Full Code of togethercomputer/OpenChatKit for AI

main a7094aa583d4 cached

83 files

440.4 KB

104.5k tokens

475 symbols

1 requests

Download .txt

Showing preview only (466K chars total). Download the full file or copy to clipboard to get everything.

Repository: togethercomputer/OpenChatKit
Branch: main
Commit: a7094aa583d4
Files: 83
Total size: 440.4 KB

Directory structure:
gitextract_ib4k5pq4/

├── .github/
│   └── ISSUE_TEMPLATE/
│       ├── bug_report.md
│       ├── feature_request.md
│       └── openchatkit-feedback-report.yaml
├── .gitignore
├── LICENSE
├── README.md
├── data/
│   ├── OIG/
│   │   └── prepare.py
│   ├── OIG-chip2/
│   │   └── prepare.sh
│   ├── OIG-moderation/
│   │   └── prepare.py
│   ├── prepare_data.py
│   └── wikipedia-3sentence-level-retrieval-index/
│       └── prepare.py
├── docs/
│   ├── GPT-NeoXT-Chat-Base-20B.md
│   └── finetuning-RedPajama-3B.md
├── environment.yml
├── inference/
│   ├── README.md
│   ├── bot.py
│   └── conversation.py
├── pretrained/
│   ├── GPT-NeoX-20B/
│   │   └── prepare.py
│   ├── Llama-2-7B-32K-beta/
│   │   └── prepare.py
│   ├── Pythia-6.9B-deduped/
│   │   └── prepare.py
│   ├── RedPajama-3B/
│   │   └── prepare.py
│   ├── RedPajama-7B/
│   │   └── prepare.py
│   └── prepare_pretrained.py
├── retrieval/
│   ├── README.md
│   ├── __init__.py
│   └── wikipedia.py
├── tools/
│   ├── README.md
│   ├── benchmark_input.json
│   ├── convert_to_hf_gptneox.py
│   ├── convert_to_hf_llama.py
│   └── model_load_benchmark.py
└── training/
    ├── README.md
    ├── comm/
    │   ├── __init__.py
    │   ├── comm_utils.py
    │   ├── nccl_backend.py
    │   └── torch_backend.py
    ├── data_parallel/
    │   ├── __init__.py
    │   ├── dist_dp_allreduce.py
    │   ├── dist_dp_central_ps.py
    │   ├── dist_dp_local.py
    │   ├── dist_dp_sharded_ps.py
    │   ├── dist_dp_utils.py
    │   └── flatten_utils.py
    ├── dist_clm_train.py
    ├── dist_prefixlm_train.py
    ├── finetune_GPT-NeoXT-Chat-Base-20B.sh
    ├── finetune_Pythia-Chat-Base-7B.sh
    ├── finetune_RedPajama-INCITE-7B-Chat.sh
    ├── finetune_RedPajama-INCITE-Chat-3B-v1.sh
    ├── finetune_llama-2-7b-32k-booksum.sh
    ├── finetune_llama-2-7b-32k-mqa.sh
    ├── lora/
    │   └── example/
    │       ├── redpajama-incite-chat-3b.py
    │       └── redpajama-incite-chat-3b_inference.py
    ├── modules/
    │   ├── __init__.py
    │   ├── deberta_modules.py
    │   ├── dist_deberta_pp_module.py
    │   ├── dist_gpt_fsdp_module.py
    │   ├── dist_gpt_pp_module.py
    │   ├── hf_gpt2_modules.py
    │   ├── hf_gptj_modules.py
    │   ├── hf_gptneox_modules.py
    │   ├── hf_opt_modules.py
    │   ├── llama_modules.py
    │   ├── task_modules.py
    │   ├── tokenizer.py
    │   └── utils.py
    ├── optimizer/
    │   ├── __init__.py
    │   ├── grad_scalar.py
    │   └── optimizer.py
    ├── pipeline_parallel/
    │   ├── __init__.py
    │   ├── dist_gpipe_pipeline_async.py
    │   └── dist_pp_utils.py
    ├── tasks/
    │   ├── __init__.py
    │   └── data_loaders/
    │       ├── __init__.py
    │       ├── data_utils.py
    │       └── prosocial.py
    └── utils/
        ├── __init__.py
        ├── dist_args_utils.py
        ├── dist_checkpoint_utils.py
        ├── dist_debug_utils.py
        ├── event_report.py
        ├── logging_utils.py
        └── upload_manager.py

================================================
FILE CONTENTS
================================================

================================================
FILE: .github/ISSUE_TEMPLATE/bug_report.md
================================================
---
name: Bug report
about: Create a report to help us improve
title: ''
labels: ''
assignees: ''

---

**Describe the bug**
A clear and concise description of what the bug is.

**To Reproduce**
Steps to reproduce the behavior:
1. Go to '...'
2. Click on '....'
3. Scroll down to '....'
4. See error

**Expected behavior**
A clear and concise description of what you expected to happen.

**Screenshots**
If applicable, add screenshots to help explain your problem.

**Desktop (please complete the following information):**
 - OS: [e.g. iOS]
 - Browser [e.g. chrome, safari]
 - Version [e.g. 22]

**Smartphone (please complete the following information):**
 - Device: [e.g. iPhone6]
 - OS: [e.g. iOS8.1]
 - Browser [e.g. stock browser, safari]
 - Version [e.g. 22]

**Additional context**
Add any other context about the problem here.


================================================
FILE: .github/ISSUE_TEMPLATE/feature_request.md
================================================
---
name: Feature request
about: Suggest an idea for this project
title: ''
labels: ''
assignees: ''

---

**Is your feature request related to a problem? Please describe.**
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

**Describe the solution you'd like**
A clear and concise description of what you want to happen.

**Describe alternatives you've considered**
A clear and concise description of any alternative solutions or features you've considered.

**Additional context**
Add any other context or screenshots about the feature request here.


================================================
FILE: .github/ISSUE_TEMPLATE/openchatkit-feedback-report.yaml
================================================
name: OpenChatKit Feedback Report
description: Details of feedback from using OpenChatKit test app
title: OpenChatKit Feedback Report
labels: "feedback report"
assignees: []
body:
  - type: markdown
    attributes:
      value: |
        Thanks for taking the time to fill out this feedback report!
  - type: textarea
    id: my-question
    attributes:
      label: "My question:"
    validations:
      required: true
  - type: textarea
    id: bot-response
    attributes:
      label: "Bot response:"
    validations:
      required: true
  - type: textarea
    id: ideal-bot-response
    attributes:
      label: "Ideal bot response:"
    validations:
      required: true
  - type: checkboxes
    id: response-issues
    attributes:
      label: "Bot response was:"
      options:
        - label: Factually incorrect
          required: true
        - label: Not helpful
          required: true
        - label: Harmful, inappropriate or unsafe
          required: true


================================================
FILE: .gitignore
================================================
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
pip-wheel-metadata/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
#  Usually these files are written by a python script from a template
#  before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
target/

# Jupyter Notebook
.ipynb_checkpoints

# IPython
profile_default/
ipython_config.py

# pyenv
.python-version

# pipenv
#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
#   However, in case of collaboration, if having platform-specific dependencies or dependencies
#   having no cross-platform support, pipenv may install dependencies that don't work, or not
#   install all needed dependencies.
#Pipfile.lock

# PEP 582; used by e.g. github.com/David-OConnor/pyflow
__pypackages__/

# Celery stuff
celerybeat-schedule
celerybeat.pid

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
.dmypy.json
dmypy.json

# Pyre type checker
.pyre/

# ignore downloaded files
/data/OIG-moderation/files/
/data/OIG/files/
/data/wikipedia-3sentence-level-retrieval-index/files/
/pretrained/GPT-NeoX-20B/EleutherAI_gpt-neox-20b/
/pretrained/Pythia-6.9B-deduped/EleutherAI_pythia-6.9b-deduped/
/pretrained/RedPajama-3B/togethercomputer_RedPajama-INCITE-Chat-3B-v1

# ignore training output
/model_ckpts/
/huggingface_models/
/training/wandb/

# ignore trained low-rank adapters
/outputs/
data/OIG-chip2/*.jsonl
wandb/

================================================
FILE: LICENSE
================================================
                                 Apache License
                           Version 2.0, January 2004
                        http://www.apache.org/licenses/

   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION

   1. Definitions.

      "License" shall mean the terms and conditions for use, reproduction,
      and distribution as defined by Sections 1 through 9 of this document.

      "Licensor" shall mean the copyright owner or entity authorized by
      the copyright owner that is granting the License.

      "Legal Entity" shall mean the union of the acting entity and all
      other entities that control, are controlled by, or are under common
      control with that entity. For the purposes of this definition,
      "control" means (i) the power, direct or indirect, to cause the
      direction or management of such entity, whether by contract or
      otherwise, or (ii) ownership of fifty percent (50%) or more of the
      outstanding shares, or (iii) beneficial ownership of such entity.

      "You" (or "Your") shall mean an individual or Legal Entity
      exercising permissions granted by this License.

      "Source" form shall mean the preferred form for making modifications,
      including but not limited to software source code, documentation
      source, and configuration files.

      "Object" form shall mean any form resulting from mechanical
      transformation or translation of a Source form, including but
      not limited to compiled object code, generated documentation,
      and conversions to other media types.

      "Work" shall mean the work of authorship, whether in Source or
      Object form, made available under the License, as indicated by a
      copyright notice that is included in or attached to the work
      (an example is provided in the Appendix below).

      "Derivative Works" shall mean any work, whether in Source or Object
      form, that is based on (or derived from) the Work and for which the
      editorial revisions, annotations, elaborations, or other modifications
      represent, as a whole, an original work of authorship. For the purposes
      of this License, Derivative Works shall not include works that remain
      separable from, or merely link (or bind by name) to the interfaces of,
      the Work and Derivative Works thereof.

      "Contribution" shall mean any work of authorship, including
      the original version of the Work and any modifications or additions
      to that Work or Derivative Works thereof, that is intentionally
      submitted to Licensor for inclusion in the Work by the copyright owner
      or by an individual or Legal Entity authorized to submit on behalf of
      the copyright owner. For the purposes of this definition, "submitted"
      means any form of electronic, verbal, or written communication sent
      to the Licensor or its representatives, including but not limited to
      communication on electronic mailing lists, source code control systems,
      and issue tracking systems that are managed by, or on behalf of, the
      Licensor for the purpose of discussing and improving the Work, but
      excluding communication that is conspicuously marked or otherwise
      designated in writing by the copyright owner as "Not a Contribution."

      "Contributor" shall mean Licensor and any individual or Legal Entity
      on behalf of whom a Contribution has been received by Licensor and
      subsequently incorporated within the Work.

   2. Grant of Copyright License. Subject to the terms and conditions of
      this License, each Contributor hereby grants to You a perpetual,
      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
      copyright license to reproduce, prepare Derivative Works of,
      publicly display, publicly perform, sublicense, and distribute the
      Work and such Derivative Works in Source or Object form.

   3. Grant of Patent License. Subject to the terms and conditions of
      this License, each Contributor hereby grants to You a perpetual,
      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
      (except as stated in this section) patent license to make, have made,
      use, offer to sell, sell, import, and otherwise transfer the Work,
      where such license applies only to those patent claims licensable
      by such Contributor that are necessarily infringed by their
      Contribution(s) alone or by combination of their Contribution(s)
      with the Work to which such Contribution(s) was submitted. If You
      institute patent litigation against any entity (including a
      cross-claim or counterclaim in a lawsuit) alleging that the Work
      or a Contribution incorporated within the Work constitutes direct
      or contributory patent infringement, then any patent licenses
      granted to You under this License for that Work shall terminate
      as of the date such litigation is filed.

   4. Redistribution. You may reproduce and distribute copies of the
      Work or Derivative Works thereof in any medium, with or without
      modifications, and in Source or Object form, provided that You
      meet the following conditions:

      (a) You must give any other recipients of the Work or
          Derivative Works a copy of this License; and

      (b) You must cause any modified files to carry prominent notices
          stating that You changed the files; and

      (c) You must retain, in the Source form of any Derivative Works
          that You distribute, all copyright, patent, trademark, and
          attribution notices from the Source form of the Work,
          excluding those notices that do not pertain to any part of
          the Derivative Works; and

      (d) If the Work includes a "NOTICE" text file as part of its
          distribution, then any Derivative Works that You distribute must
          include a readable copy of the attribution notices contained
          within such NOTICE file, excluding those notices that do not
          pertain to any part of the Derivative Works, in at least one
          of the following places: within a NOTICE text file distributed
          as part of the Derivative Works; within the Source form or
          documentation, if provided along with the Derivative Works; or,
          within a display generated by the Derivative Works, if and
          wherever such third-party notices normally appear. The contents
          of the NOTICE file are for informational purposes only and
          do not modify the License. You may add Your own attribution
          notices within Derivative Works that You distribute, alongside
          or as an addendum to the NOTICE text from the Work, provided
          that such additional attribution notices cannot be construed
          as modifying the License.

      You may add Your own copyright statement to Your modifications and
      may provide additional or different license terms and conditions
      for use, reproduction, or distribution of Your modifications, or
      for any such Derivative Works as a whole, provided Your use,
      reproduction, and distribution of the Work otherwise complies with
      the conditions stated in this License.

   5. Submission of Contributions. Unless You explicitly state otherwise,
      any Contribution intentionally submitted for inclusion in the Work
      by You to the Licensor shall be under the terms and conditions of
      this License, without any additional terms or conditions.
      Notwithstanding the above, nothing herein shall supersede or modify
      the terms of any separate license agreement you may have executed
      with Licensor regarding such Contributions.

   6. Trademarks. This License does not grant permission to use the trade
      names, trademarks, service marks, or product names of the Licensor,
      except as required for reasonable and customary use in describing the
      origin of the Work and reproducing the content of the NOTICE file.

   7. Disclaimer of Warranty. Unless required by applicable law or
      agreed to in writing, Licensor provides the Work (and each
      Contributor provides its Contributions) on an "AS IS" BASIS,
      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
      implied, including, without limitation, any warranties or conditions
      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
      PARTICULAR PURPOSE. You are solely responsible for determining the
      appropriateness of using or redistributing the Work and assume any
      risks associated with Your exercise of permissions under this License.

   8. Limitation of Liability. In no event and under no legal theory,
      whether in tort (including negligence), contract, or otherwise,
      unless required by applicable law (such as deliberate and grossly
      negligent acts) or agreed to in writing, shall any Contributor be
      liable to You for damages, including any direct, indirect, special,
      incidental, or consequential damages of any character arising as a
      result of this License or out of the use or inability to use the
      Work (including but not limited to damages for loss of goodwill,
      work stoppage, computer failure or malfunction, or any and all
      other commercial damages or losses), even if such Contributor
      has been advised of the possibility of such damages.

   9. Accepting Warranty or Additional Liability. While redistributing
      the Work or Derivative Works thereof, You may choose to offer,
      and charge a fee for, acceptance of support, warranty, indemnity,
      or other liability obligations and/or rights consistent with this
      License. However, in accepting such obligations, You may act only
      on Your own behalf and on Your sole responsibility, not on behalf
      of any other Contributor, and only if You agree to indemnify,
      defend, and hold each Contributor harmless for any liability
      incurred by, or claims asserted against, such Contributor by reason
      of your accepting any such warranty or additional liability.

   END OF TERMS AND CONDITIONS

   APPENDIX: How to apply the Apache License to your work.

      To apply the Apache License to your work, attach the following
      boilerplate notice, with the fields enclosed by brackets "[]"
      replaced with your own identifying information. (Don't include
      the brackets!)  The text should be enclosed in the appropriate
      comment syntax for the file format. We also recommend that a
      file or class name and description of purpose be included on the
      same "printed page" as the copyright notice for easier
      identification within third-party archives.

   Copyright [yyyy] [name of copyright owner]

   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at

       http://www.apache.org/licenses/LICENSE-2.0

   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License.

------------- LICENSE for training code -------------

Copyright (c) 2022 Anonymous Institution 

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
Footer


================================================
FILE: README.md
================================================
# OpenChatKit

OpenChatKit provides a powerful, open-source base to create both specialized and general purpose models for various applications. The kit includes an instruction-tuned language models, a moderation model, and an extensible retrieval system for including up-to-date responses from custom repositories. OpenChatKit models were trained on the OIG-43M training dataset, which was a collaboration between [Together](https://www.together.xyz/), [LAION](https://laion.ai), and [Ontocord.ai](https://ontocord.ai). 

In this repo, you'll find code for:
- Training GPT-NeoXT-Chat-Base-20B, a 20B parameter chat model (see [docs/GPT-NeoXT-Chat-Base-20B.md](docs/GPT-NeoXT-Chat-Base-20B.md))
- Fine-tuning Llama-2-7B-32K-beta, a 7B parameter long context model
- Training Pythia-Chat-Base-7B, a 7B parameter chat model
- Testing inference using either of the chat models
- Augmenting the model with additional context from a retrieval index

# Contents

- [Getting Started](#getting-started)
  * [Requirements](#requirements)
  * [Chatting with Pythia-Chat-Base-7B](#chatting-with-pythia-chat-base-7b)
- [Fine-tuning Llama-2-7B-32K-beta](#fine-tuning-llama-2-7b-32k-beta)
  * [Downloading and converting the base model](#downloading-and-converting-the-base-model)
  * [Fine-tuning the model](#fine-tuning-the-model)
  * [Converting trained weights to Hugging Face format](#converting-trained-weights-to-hugging-face-format)
- [Reproducing Pythia-Chat-Base-7B](#reproducing-pythia-chat-base-7b)
  * [Downloading training data and the base model](#downloading-training-data-and-the-base-model)
  * [(Optional) 8bit Adam](#optional-8bit-adam)
  * [Training the model](#training-the-model)
  * [Converting weights to Hugging Face format](#converting-weights-to-hugging-face-format)
  * [Testing the new model](#testing-the-new-model)
- [Monitoring](#monitoring)
  * [Loguru](#loguru)
  * [Weights & Biases](#weights--biases)
- [Experimental: Retrieval-Augmented Models](#experimental-retrieval-augmented-models)
- [See Also](#see-also)
- [License](#license)
- [Citing OpenChatKit](#citing-openchatkit)
- [Acknowledgements](#acknowledgements)

# Getting Started

In this tutorial, you will download Pythia-Chat-Base-7B, an instruction-tuned language model, and run some some inference requests against it using a command-line tool.

Pythia-Chat-Base-7B is a 7B-parameter fine-tuned variant of Pythia-6.9B-deduped from Eleuther AI. Pre-trained weights for this model are available on Hugging Face as [togethercomputer/Pythia-Chat-Base-7B](https://huggingface.co/togethercomputer/Pythia-Chat-Base-7B) under an Apache 2.0 license.

More details can be found on the model card for [Pythia-Chat-Base-7B](https://huggingface.co/togethercomputer/Pythia-Chat-Base-7B) on Hugging Face.

## Requirements

Before you begin, you need to install PyTorch and other dependencies.

1. Install [Miniconda](https://docs.conda.io/en/latest/miniconda.html) from their website.

2. Install [Git LFS](https://git-lfs.com/) from their website.

3. Install the `git lfs` hooks.

```shell
git lfs install
```

4. Install mamba in the `base` environment so it's available in all environments.

```shell
conda install mamba -n base -c conda-forge
```

5. Create an environment called OpenChatKit using the `environment.yml` file at the root of this repo.

> **Note**
> Use `mamba` to create the environment. It's **much** faster than using `conda`.

```shell
mamba env create -f environment.yml 
```

6. Activate the new conda environment.

```shell
conda activate OpenChatKit
```

## Chatting with Pythia-Chat-Base-7B

To help you try the model, [`inference/bot.py`](inference/bot.py) is a simple command-line test harness that provides a shell inferface enabling you to chat with the model. Simply enter text at the prompt and the model replies. The test harness also maintains conversation history to provide the model with context.


Start the bot by calling `bot.py` from the root for the repo.

```shell
python inference/bot.py --model togethercomputer/Pythia-Chat-Base-7B
```

Loading the model can take some time, but once it's loaded, you are greeted with a prompt. Say hello.

```shell
$ python inference/bot.py 
Loading /home/csris/src/github.com/togethercomputer/OpenChatKit/inference/../huggingface_models/GPT-NeoXT-Chat-Base-20B to cuda:1...
Welcome to OpenChatKit shell.   Type /help or /? to list commands.

>>> Hello.
Hello human.

>>> 
```

Enter additional queries at the prompt, and the model replies. Under the covers, the shell is forming a prompt with all previous queries and passes that to the model to generate more text.

The shell also supports additional commands to inspect hyperparamters, the full prompt, and more. Commands are prefixed with a `/`.

> **Note**
> The `/quit` command exits the shell.

Please see [the inference README](inference/README.md) for more details about arguments, running on multiple/specific GPUs, and running on consumer hardware.

# Fine-tuning Llama-2-7B-32K-beta

Llama-2-7B-32K-beta model can be fine-tuned using various datasets. In this tutorial, we will use the multi-document natural questions dataset and BookSum dataset.

## Downloading and converting the base model

To download model Llama-2-7B-32K-beta and prepare it for fine-tuning, run this command from the root of the repository.

```shell
python pretrained/Llama-2-7B-32K-beta/prepare.py
```

The weights for this model will be in the `pretrained/Llama-2-7B-32K-beta/togethercomputer_Llama-2-7B-32K-beta` directory.


## Fine-tuning the model

The `training/finetune_llama-2-7b-32k-mqa.sh` and `training/finetune_llama-2-7b-32k-booksum.sh` scripts configure and run the training loop.

1. To fine-tune the multi-document natural questions dataset, run:
   ```shell
   bash training/finetune_llama-2-7b-32k-mqa.sh
   ```

2. To fine-tune the BookSum dataset, run:
   ```shell
   bash training/finetune_llama-2-7b-32k-booksum.sh
   ```

As the training loop runs, checkpoints are saved to the `model_ckpts` directory at the root of the repo.

Please see [the training README](training/README.md) for more details about customizing the training run.

## Converting trained weights to Hugging Face format

Before you can use this model to perform inference, it must be converted to the Hugging Face format. Run this command from the root of the repo to do so.

For example
```shell
mkdir huggingface_models \
  && python tools/convert_to_hf_llama.py \
       --config-name togethercomputer/Llama-2-7B-32K-beta \
       --ckpt-path model_ckpts/llama-2-7b-32k-mqa/checkpoint_10 \
       --save-path huggingface_models/llama-2-7b-32k-mqa \
       --n-stages 4 \
       --n-layer-per-stage 8 \
       --fp16
```
where the `--fp16` flag will load and store models in fp16.

Make sure to replace model_ckpts/llama-2-7b-32k-mqa/checkpoint_10` with the latest checkpoint in the `model_ckpts/llama-2-7b-32k-mqa` or `model_ckpts/llama-2-7b-32k-booksum` directory.


# Reproducing Pythia-Chat-Base-7B

This tutorial walks through reproducing the Pythia-Chat-Base-7B model by fine-tuning Eleuther AI's Pythia-6.9B-deduped model using the OIG dataset.

## Downloading training data and the base model

The chat model was trained on the [OIG](https://huggingface.co/datasets/laion/OIG) dataset built by [LAION](https://laion.ai/), [Together](https://www.together.xyz/), and [Ontocord.ai](https://www.ontocord.ai/). To download the dataset from Hugging Face run the command below from the root of the repo.

```shell
python data/OIG/prepare.py
```
> **Note** 
> You can help make this chat model better by contributing data! See the [OpenDataHub](https://github.com/togethercomputer/OpenDataHub) repo for more details.

Once the command completes, the data will be in the `data/OIG/files` directory.

Pythia-Chat-Base-7B is a fine-tuned variant of Pythia-6.9B-deduped from Eleuther AI. To download the model and prepare it for fine tuning, run this command from the root of the repo.

```shell
python pretrained/Pythia-6.9B-deduped/prepare.py
```

The weights for this model will be in the `pretrained/Pythia-6.9B-deduped/EleutherAI_pythia-6.9b-deduped` directory.

## (Optional) 8bit Adam

To use 8bit-adam during training, install the `bitsandbytes` package.

```shell
pip install bitsandbytes # optional, to use 8bit-adam
```

## Training the model

The `training/finetune_Pythia-Chat-Base-7B.sh` script configures and runs the training loop. After downloading the dataset and the base model, run:

```shell
bash training/finetune_Pythia-Chat-Base-7B.sh
```

As the training loop runs, checkpoints are saved to the `model_ckpts` directory at the root of the repo.

Please see [the training README](training/README.md) for more details about customizing the training run.

## Converting weights to Hugging Face format

Before you can use this model to perform inference, it must be converted to the Hugging Face format. Run this command from the root of the repo to do so.

```shell
mkdir huggingface_models \
  && python tools/convert_to_hf_gptneox.py \
       --config-name EleutherAI/pythia-6.9b-deduped \
       --ckpt-path model_ckpts/Pythia-Chat-Base-7B/checkpoint_100 \
       --save-path huggingface_models/Pythia-Chat-Base-7B \
       --n-stages 4 \
       --n-layer-per-stage 8 \
       --fp16
```
where the `--fp16` flag will load and store models in fp16.

Make sure to replace `model_ckpts/Pythia-Chat-Base-7B/checkpoint_100` with the latest checkpoint in the `model_ckpts/Pythia-Chat-Base-7B` directory.

## Testing the new model

You can use the OpenChatKit Shell test harness to chat with the new model. From the root of the repo, run

```shell
python inference/bot.py
```

By default the script will load the model named Pythia-Chat-Base-7B under the `huggingface_models` directory, but you can override that behavior by specifying `--model`.

```shell
python inference/bot.py --model ./huggingface_models/GPT-NeoXT-Chat-Base-20B
```

Once the model has loaded, enter text at the prompt and the model will reply.

```shell
$ python inference/bot.py 
Loading /home/csris/src/github.com/togethercomputer/OpenChatKit/inference/../huggingface_models/GPT-NeoXT-Chat-Base-20B to cuda:1...
Welcome to OpenChatKit shell.   Type /help or /? to list commands.

>>> Hello.
Hello human.

>>> 
```

The shell also supports additional commands to inspect hyperparamters, the full prompt, and more. Commands are prefixed with a `/`.

> **Note**
> The `/quit` command exits the shell.

Please see [the inference README](inference/README.md) for more details about arguments, running on multiple/specific GPUs, and running on consumer hardware.

# Monitoring

By default, the training script simply prints the loss as training proceeds, but it can also output metrics to a file using [loguru](https://github.com/Delgan/loguru) or report them to Weights & Biases.

## Loguru

Add the flag `--train-log-backend loguru` to your training script to log to `./logs/file_{time}.log`

## Weights & Biases

To use Weights & Biases, first login with your Weights & Biases token.

```shell
wandb login
```

And set `--train-log-backend wandb` in the training script to enable logging to Weights & Biases.

# Experimental: Retrieval-Augmented Models

> **Warning**
> Retrieval support is experimental.

The code in `/retrieval` implements a python package for querying a Faiss index of Wikipedia. The following steps explain how to use this index to augment queries in the test harness with context from the retriever.

1. Download the Wikipedia index.

```shell
python data/wikipedia-3sentence-level-retrieval-index/prepare.py
```

2. Run the bot with the `--retrieval` flag.

```shell
python inference/bot.py --retrieval
```

After starting, the bot will load both the chat model and the retrieval index, which takes a long time. Once the model and the index are loaded, all queries will be augmented with extra context.


```shell
$ python inference/bot.py --retrieval
Loading /OpenChatKit/inference/../huggingface_models/GPT-NeoXT-Chat-Base-20B to cuda:0...
Loading retrieval index...
Welcome to OpenChatKit shell.   Type /help or /? to list commands.

>>> Where is Zurich?
Where is Zurich?
Zurich is located in Switzerland.

>>>
```

# See Also
* [docs/GPT-NeoXT-Chat-Base-20B.md](docs/GPT-NeoXT-Chat-Base-20B.md). OpenChatKit also provides a larger, 20B parameter chat model that was trained on GPT-NeoXT-Chat-Base-20B from Eleuther AI.

# License

All code in this repository was developed by Together Computer except where otherwise noted.  Copyright (c) 2023, Together Computer.  All rights reserved. The code is licensed under the Apache 2.0 license.


```
Copyright 2023 Together Computer

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

   http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
```

This repository also contains code written by a number of other authors. Such contributions are marked and the relevant licensing is included where appropriate.

For full terms, see the LICENSE file. If you have any questions, comments, or concerns about licensing please [contact us](https://www.together.xyz/contact).

# Citing OpenChatKit

```bibtex
@software{openchatkit,
  title = {{OpenChatKit: An Open Toolkit and Base Model for Dialogue-style Applications}},
  author = {Together Computer},
  url = {https://github.com/togethercomputer/OpenChatKit}
  month = {3},
  year = {2023},
  version = {0.15},
}
```

# Acknowledgements

Our models are fine-tuned versions of large language models trained by [Eleuther AI](https://www.eleuther.ai). We evaluated our model on [HELM](https://crfm.stanford.edu/helm/latest/) provided by the [Center for Research on Foundation Models](https://crfm.stanford.edu). And we collaborated with both [CRFM](https://crfm.stanford.edu) and [HazyResearch](http://hazyresearch.stanford.edu) at Stanford to build this model.

We collaborated with [LAION](https://laion.ai/) and [Ontocord.ai](https://www.ontocord.ai/) to build the training data used to fine tune this model.


================================================
FILE: data/OIG/prepare.py
================================================
import sys
import os

# Import the prepare_data function
current_dir = os.path.dirname(os.path.abspath(__file__))
sys.path.append(os.path.join(current_dir, '..'))
from prepare_data import prepare_data

if __name__ == "__main__":
    dest_dir = os.path.join(current_dir, "files")
    prepare_data("https://huggingface.co/datasets/laion/OIG", dest_dir)


================================================
FILE: data/OIG-chip2/prepare.sh
================================================
DIR=$(cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd)
wget https://huggingface.co/datasets/laion/OIG/resolve/main/unified_chip2.jsonl -O ${DIR}/unified_chip2.jsonl

================================================
FILE: data/OIG-moderation/prepare.py
================================================
import sys
import os

# Import the prepare_data function
current_dir = os.path.dirname(os.path.abspath(__file__))
sys.path.append(os.path.join(current_dir, '..'))
from prepare_data import prepare_data

if __name__ == "__main__":
    dest_dir = os.path.join(current_dir, "files")
    prepare_data("https://huggingface.co/datasets/ontocord/OIG-moderation", dest_dir)


================================================
FILE: data/prepare_data.py
================================================
import argparse
from shutil import copyfile
import boto3
import botocore
import glob
import gzip
import os
import re
import requests
import shutil
import subprocess
import sys
from urllib.parse import urlparse


# Check if git-lfs is installed.
def is_git_lfs_installed():
    try:
        process = subprocess.run(['git', 'lfs', 'version'], 
                                 stdout=subprocess.DEVNULL, 
                                 stderr=subprocess.DEVNULL)
        return process.returncode == 0
    except FileNotFoundError:
        return False

# Check if a url is a Hugging Face git URL.
def is_huggingface_git_url(url):
    # Regular expression pattern for Hugging Face git URLs
    hf_git_pattern = r'^https://huggingface\.co/datasets/[A-Za-z0-9_\.\-/]+$'
    
    # Match the pattern against the URL
    # Return True if a match is found, False otherwise
    return re.match(hf_git_pattern, url) is not None

# Check if the path is a GitHub repository URL.
def is_github_repo_url(url):
    # Regular expression patterns for GitHub repository URLs
    ssh_pattern = r'^git@github\.com:[A-Za-z0-9_.-]+/[A-Za-z0-9_.-]+\.git$'
    http_pattern = r'^https?://(www\.)?github\.com/[A-Za-z0-9_.-]+/[A-Za-z0-9_.-]+\.git$'
    
    # Match the patterns against the path
    # Return True if a match is found in either SSH or HTTP pattern, False otherwise
    return re.match(ssh_pattern, url) is not None or re.match(http_pattern, url) is not None


# Check if the path is an S3 or R2 repository URL.
def is_s3_url(url):
    # Regular expression pattern for S3 URLs
    s3_pattern = r'^https?://(s3(-[a-z0-9-]+)?\.amazonaws|[a-fA-F0-9]+\.r2\.cloudflarestorage)\.com/[a-z0-9][a-z0-9\.\-]{1,61}[a-z0-9]/[0-9a-zA-Z!\-_\.*\'()/]+$'
    
    # Match the pattern against the URL
    # Return True if a match is found, False otherwise
    if re.match(s3_pattern, url) is None:
        return False
    
    # Check for a valid bucket name
    bucket_name = url.split('/')[3]
    if bucket_name.startswith("xn--"):
        return False
    if bucket_name.endswith("-s3alias"):
        return False
    if bucket_name.endswith("--ol-s3"):
        return False
    if re.match(r'\.\.', bucket_name) is not None:
        return False
    if re.match(r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}', bucket_name) is not None:
        return False
    if re.match(r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}', bucket_name) is not None:
        return False
    
    return True


# Check that the current git repository has git-lfs installed. If the git-lfs
# is not installed, then run `git lfs install` if git-lfs is installed. If 
# git-lfs is not installed, then print an error message and exit.
def clone_git_repo(data_source, destination_dir):
    process = subprocess.run(
        'git lfs env | grep -q \'git config filter.lfs.smudge = "git-lfs smudge -- %f"\'',
        shell=True
    )

    # Check if the git repository has already been cloned
    if os.path.exists(os.path.join(destination_dir, ".git")):
        print(f"Git repository already exists at {destination_dir}. Skipping clone.")
        return

    # Check if git-lfs is installed
    if process.returncode != 0 and is_git_lfs_installed():
        subprocess.run('git lfs install', shell=True, check=True)
        process = subprocess.run(
            'git lfs install',
            shell=True
        )

    if process.returncode != 0:
        print('error: git lfs not installed. please install git-lfs and run `git lfs install`')
        sys.exit(1)

    # Clone a GitHub repository.
    try:
        subprocess.run(f"git clone {data_source} {destination_dir}", shell=True,
                       check=True)
    except subprocess.CalledProcessError:
        print(f"error: failed to clone repository {data_source}")
        sys.exit(1)

    

# Download all files from an S3 compatible storage service.
def download_from_s3(url, destination_dir, access_key_id = None,
                     secret_access_key = None, session_token = None, debug = False):
    # Get the access key ID and secret access key from the environment variables
    if access_key_id is None:
        access_key_id = os.environ.get('AWS_ACCESS_KEY_ID')
    if secret_access_key is None:
        secret_access_key = os.environ.get('AWS_SECRET_ACCESS_KEY')
    if session_token is None:
        session_token = os.environ.get('AWS_SESSION_TOKEN')
    
    # Create an S3 client
    parsed_url = url.split('/')
    endpoint_url = f"{parsed_url[0]}//{parsed_url[2]}"
    bucket_name = parsed_url[3]
    key_prefix = "/".join(parsed_url[4:-1])
    base_file = parsed_url[-1] if not url.endswith('/') else ""
    
    print(f"endpoint_url={endpoint_url} ...")
    if debug:
        print(f"access_key_id={access_key_id}")
        print(f"secret_access_key={secret_access_key}")
        print(f"bucket_name={bucket_name}")
        print(f"key_prefix={key_prefix}")
        print(f"base_file={base_file}")

    s3 = boto3.resource('s3',
        endpoint_url = endpoint_url,
        aws_access_key_id = access_key_id,
        aws_secret_access_key = secret_access_key,
        aws_session_token=session_token,
        region_name = "auto"
    )
    
    # Create the destination directory if it does not exist
    os.makedirs(destination_dir, exist_ok=True)

    try:
        print(f"Downloading file(s) from S3 {url} to {destination_dir} ...")
        bucket = s3.Bucket(bucket_name)
        
        # Otherwise, download the file at the prefix
        if url.endswith('/'):
            # Download the file from the S3 path
            for obj in bucket.objects.filter(Prefix=key_prefix):
                if not obj.key.endswith('/'):
                    destination_file = os.path.join(destination_dir, os.path.basename(obj.key))
                    if not os.path.exists(destination_file):
                        print(f"Downloading {obj.key} ...")
                        bucket.download_file(obj.key, destination_file)
                    else:
                        print(f"File already exists, skipping {obj.key}")
        else:
            destination_file = os.path.join(destination_dir, base_file)
            if not os.path.exists(destination_file):
                print(f"Downloading {base_file} ...")
                bucket.download_file(f'/{key_prefix}/{base_file}', destination_file)
            else:
                print(f"File already exists, skipping {base_file}")

        print("Download completed successfully.")
        return
    
    except botocore.exceptions.NoCredentialsError:
        print("Error: AWS credentials not found.") 
    except botocore.exceptions.EndpointConnectionError:
        print("Error: Unable to connect to the S3 endpoint.")
    except botocore.exceptions.ParamValidationError as e:
        print(f"Error: Invalid S3 URL: {e}")
    except botocore.exceptions.ClientError as e:
        print(f"Error: {e.response['Error']['Message']}")
    
    # Something went wrong, exit with error.
    sys.exit(1)

def download_from_url(url, destination_dir):
    print(f"Downloading file from {url} to {destination_dir} ...")
    try:
        # Parse the URL to extract the filename
        parsed_url = urlparse(url)
        filename = os.path.basename(parsed_url.path)
        
        # Construct the destination file path
        destination_file = os.path.join(destination_dir, filename)
        
        # Download the file
        response = requests.get(url, stream=True)
        response.raise_for_status()
        with open(destination_file, 'wb') as f:
            for chunk in response.iter_content(chunk_size=8192): 
                f.write(chunk)
        print("Download completed successfully.")
        return
    
    except requests.exceptions.HTTPError as e:
        print(f"Error: {e}")
    except requests.exceptions.ConnectionError:
        print("Error: Unable to connect to the URL.")
    except requests.exceptions.Timeout:
        print("Error: Connection timed out.")
    except requests.exceptions.RequestException as e:
        print(f"Error: {e}")

    # Something went wrong, exit with error.
    sys.exit(1)

# Perepare data will clone the git repository given by data_source into the
# destination_dir.
def prepare_data(data_source, destination_dir, access_key_id=None, secret_access_key=None, debug=False):

    # Check that destination_dir is a directory. If it does not exist, then
    # create it.
    if not os.path.exists(destination_dir):
        os.makedirs(destination_dir)
    elif not os.path.isdir(destination_dir):
        print(f"Error: {destination_dir} is not a directory.")
        sys.exit(1)

    if os.path.isfile(data_source):
        # Handle the case where the data source is a local file
        print(f"Copying file {data_source} to {destination_dir} ...")
        copyfile(data_source, destination_dir)
    elif is_github_repo_url(data_source) or is_huggingface_git_url(data_source):
        # Handle the case where the data source is a GitHub or Hugging Face repository
        clone_git_repo(data_source, destination_dir)
    elif is_s3_url(data_source):
        # Handle the case where the data source is an S3 URL
        download_from_s3(url=data_source, destination_dir=destination_dir, access_key_id=access_key_id, 
                         secret_access_key=secret_access_key, debug=debug)
    elif data_source.startswith('http://') or data_source.startswith('https://'):
        # Handle the case where the data source is a URL
        download_from_url(data_source, destination_dir)
    else:
        print(f"Error: Invalid data source: {data_source}")
        sys.exit(1)

    # Extract gzipped files, if present
    for file_path in glob.glob(f"{destination_dir}/*.gz"):
        out_path, _ = os.path.splitext(file_path)
        with gzip.open(file_path, 'rb') as infile, open(out_path, 'wb') as outfile:
            shutil.copyfileobj(infile, outfile)
        os.remove(file_path)
    
def main():
    parser = argparse.ArgumentParser(description="Script for cloning a git repository and extracting files.")
    parser.add_argument("-s", "--data-source", required=True, help="URL of the data source (git repository)")
    parser.add_argument("-d", "--dest", required=True, help="Destination directory to clone the repository and extract files")
    parser.add_argument("-a", "--access-key-id", required=False, help="AWS access key ID")
    parser.add_argument("-k", "--secret-access-key", required=False, help="AWS secret access key")
    parser.add_argument("--debug", action='store_true', help="Enable debug mode")

    args = parser.parse_args()
    prepare_data(args.data_source, args.dest, args.access_key_id, args.secret_access_key, args.debug)


if __name__ == "__main__":
    main()


================================================
FILE: data/wikipedia-3sentence-level-retrieval-index/prepare.py
================================================
import sys
import os

# Import the prepare_data function
current_dir = os.path.dirname(os.path.abspath(__file__))
sys.path.append(os.path.join(current_dir, '..'))
from prepare_data import prepare_data

if __name__ == "__main__":
    dest_dir = os.path.join(current_dir, "files")
    prepare_data("https://huggingface.co/datasets/ChristophSchuhmann/wikipedia-3sentence-level-retrieval-index", dest_dir)


================================================
FILE: docs/GPT-NeoXT-Chat-Base-20B.md
================================================
# GPT-NeoXT-Chat-Base-20B

OpenChatKit includes an instruction-tuned 20 billion parameter language model called GPT-NeoXT-Chat-Base-20B, a 6 billion parameter moderation model, and an extensible retrieval system for including up-to-date responses from custom repositories. It was trained on the OIG-43M training dataset, which was a collaboration between [Together](https://www.together.xyz/), [LAION](https://laion.ai), and [Ontocord.ai](https://ontocord.ai). Much more than a model release, this is the beginning of an open source project. We are releasing a set of tools and processes for ongoing improvement with community contributions. 

In this doc, you'll find steps for:
- Training an OpenChatKit model
- Testing inference using the model
- Augmenting the model with additional context from a retrieval index

# Contents

- [Requirements](#requirements)
- [Pre-trained Weights](#pre-trained-weights)
- [Datasets](#datasets)
  * [Data Contributions](#data-contributions)
- [Pretrained Base Model](#pretrained-base-model)
- [Training and Finetuning](#training-and-finetuning)
  * [(Optional) 8bit Adam](#optional-8bit-adam)
  * [Train GPT-NeoX-Chat-Base-20B](#train-gpt-neox-chat-base-20b)
- [Converting Weights to Huggingface Format](#converting-weights-to-huggingface-format)
- [Inference](#inference)
- [Monitoring](#monitoring)
  * [Loguru](#loguru)
  * [Weights & Biases](#weights--biases)
- [Experimental: Retrieval-Augmented Models](#experimental-retrieval-augmented-models)
- [Acknowledgements](#acknowledgements)

# Requirements

Before you begin, you need to install PyTorch and other dependencies.

1. Install [Miniconda](https://docs.conda.io/en/latest/miniconda.html) from their website.

2. Install [Git LFS](https://git-lfs.com/) from their website.

3. Install the `git lfs` hooks.

```shell
git lfs install
```

4. Install mamba in the `base` environment so it's available in all environments.

```shell
conda install mamba -n base -c conda-forge
```

5. Create an environment called OpenChatKit using the `environment.yml` file at the root of this repo.

```shell
mamba env create -f environment.yml 
```

6. Activate the new conda environment.

```shell
conda activate OpenChatKit
```

# Pre-trained Weights

GPT-NeoXT-Chat-Base-20B is a 20B-parameter variant of GPT-NeoX, fine-tuned on conversational datasets. We are releasing pre-trained weights for this model as [togethercomputer/GPT-NeoXT-Chat-Base-20B](https://huggingface.co/togethercomputer/GPT-NeoXT-Chat-Base-20B) on Huggingface.

More details can be found on the model card for [GPT-NeoXT-Chat-Base-20B](https://huggingface.co/togethercomputer/GPT-NeoXT-Chat-Base-20B) on Huggingface.

# Datasets

The chat model was trained on the [OIG](https://huggingface.co/datasets/laion/OIG) dataset built by [LAION](https://laion.ai/), [Together](https://www.together.xyz/), and [Ontocord.ai](https://www.ontocord.ai/). To download the dataset from Huggingface run the command below from the root of the repo.

```shell
python data/OIG/prepare.py
```

Once the command completes, the data will be in the `data/OIG/files` directory.

## Data Contributions

You can help make this chat model better by contributing data! See the [OpenDataHub](https://github.com/togethercomputer/OpenDataHub) repo for more details.

# Pretrained Base Model

As mentioned above, the chat model is a fine-tuned variant of GPT-NeoX-20B from Eleuther AI. To download GPT-NeoX-20B and prepare it for fine tuning, run this command from the root of the repo.

```shell
python pretrained/GPT-NeoX-20B/prepare.py
```

The weights for this model will be in the `pretrained/GPT-NeoX-20B/EleutherAI_gpt-neox-20b`.

In case you want to fine-tune other gpt-neox models, e.g. [the Pythia model suite](https://huggingface.co/models?sort=downloads&search=pythia), you can specify the HF model name, for example:

```shell
python pretrained/GPT-NeoX-20B/prepare.py --model-name EleutherAI/pythia-6.9b-deduped
```

And the weights for this model will be in the `pretrained/GPT-NeoX-20B/EleutherAI_pythia-6.9b-deduped`.


# Training and Finetuning

## (Optional) 8bit Adam

To use 8bit-adam during training, install the `bitsandbytes` package.

```shell
pip install bitsandbytes # optional, to use 8bit-adam
```

## Train GPT-NeoX-Chat-Base-20B

The `training/finetune_GPT-NeoXT-Chat-Base-20B.sh` script configures and runs the training loop. After downloading the dataset and the base model, run:

```shell
bash training/finetune_GPT-NeoXT-Chat-Base-20B.sh
```

The script launches 8 processes with a pipeline-parallel degree of 8 and a data-parallel degree of 1.

As the training loop runs, checkpoints are saved to the `model_ckpts` directory at the root of the repo.

Please see [the training README](training/README.md) for more details about customizing the training run.

The `training/finetune_Pythia-Chat-Base-7B.sh` script is another example to fine-tune a 7B pythia (gpt-neox) model. The script launches 8 processes with a pipeline-parallel degree of 4 and a data-parallel degree of 2.

# Converting Weights to Huggingface Format

Before you can use this model to perform inference, it must be converted to the Huggingface format. Run this command from the root of the repo to do so.

```shell
mkdir huggingface_models \
  && python tools/convert_to_hf_gptneox.py \
       --ckpt-path model_ckpts/GPT-Neo-XT-Chat-Base-20B/checkpoint_100  \
       --save-path huggingface_models/GPT-NeoXT-Chat-Base-20B  \
       --n-stages 8  \
       --n-layer-per-stage 6 \
       --fp16
```
where the `--fp16` flag will load and store models in fp16.

Make sure to replace `model_ckpts/GPT-Neo-XT-Chat-Base-20B/checkpoint_100` with the latest checkpoint in the `model_ckpts/GPT-Neo-XT-Chat-Base-20B` directory.

If you need to convert ckpts of other gpt-neox variants, make sure to specify the correct config name for your variant.
For example, if you want to convert a checkpoint fine-tuned from `EleutherAI/pythia-6.9b-deduped`, you should indicate this as a config name:
```shell
python tools/convert_to_hf_gptneox.py \
    --config-name EleutherAI/pythia-6.9b-deduped \
    --ckpt-path model_ckpts/Pythia-Chat-Base-7B/checkpoint_100 \
    --save-path huggingface_models/Pythia-Chat-Base-7B \
    --n-stages 4 \
    --n-layer-per-stage 8 \
    --fp16
```


# Inference

To help you test the model, we provide a simple test command line test harness to interact with the bot. 

```shell
python inference/bot.py
```

By default the script will load the model named GPT-NeoXT-Chat-Base-20B model under the `huggingface_models` directory, but you can override that behavior by specifying `--model`.

For example, if you want to load the base model from our Huggingface, repo, you can run the following command which downloads the weights from HuggingFace.

```shell
python inference/bot.py --model togethercomputer/GPT-NeoXT-Chat-Base-20B
```

Once the model has loaded, enter text at the prompt and the model will reply.

```shell
$ python inference/bot.py 
Loading /home/csris/src/github.com/togethercomputer/OpenChatKit/inference/../huggingface_models/GPT-NeoXT-Chat-Base-20B to cuda:1...
Welcome to OpenChatKit shell.   Type /help or /? to list commands.

>>> Hello.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Hello human.

>>> 
```

Commands are prefixed with a `/`, and the `/quit` command exits.

Please see [the inference README](inference/README.md) for more details about arguments, running on multiple/specific GPUs, and running on consumer hardware.

# Monitoring

By default, the training script simply prints the loss as training proceeds, but it can also output metrics to a file using [loguru](https://github.com/Delgan/loguru) or report them to Weights & Biases.

## Loguru

Add the flag `--train-log-backend loguru` to your training script to log to `./logs/file_{time}.log`

## Weights & Biases

To use Weights & Biases, first login with your Weights & Biases token.

```shell
wandb login
```

And set `--train-log-backend wandb` in the training script to enable logging to Weights & Biases.

# Experimental: Retrieval-Augmented Models

*Note: Retrieval is still experimental.*

The code in `/retrieval` implements a python package for querying a Faiss index of Wikipedia. The following steps explain how to use this index to augment queries in the test harness with context from the retriever.

1. Download the Wikipedia index.

```shell
python data/wikipedia-3sentence-level-retrieval-index/prepare.py
```

2. Run the bot with the `--retrieval` flag.

```shell
python inference/bot.py --retrieval
```

After starting, the bot will load both the chat model and the retrieval index, which takes a long time. Once the model and the index are loaded, all queries will be augmented with extra context.


```shell
$ python inference/bot.py --retrieval
Loading /OpenChatKit/inference/../huggingface_models/GPT-NeoXT-Chat-Base-20B to cuda:0...
Loading retrieval index...
Welcome to OpenChatKit shell.   Type /help or /? to list commands.

>>> Where is Zurich?
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Where is Zurich?
Zurich is located in Switzerland.

>>>
```

# Acknowledgements

Our model is a fine-tuned version of [gpt-neox-20b](https://huggingface.co/EleutherAI/gpt-neox-20b), a large language model trained by [Eleuther AI](https://www.eleuther.ai). We evaluated our model on [HELM](https://crfm.stanford.edu/helm/latest/) provided by the [Center for Research on Foundation Models](https://crfm.stanford.edu). And we collaborated with both [CRFM](https://crfm.stanford.edu) and [HazyResearch](http://hazyresearch.stanford.edu) at Stanford to build this model.

We collaborated with [LAION](https://laion.ai/) and [Ontocord.ai](https://www.ontocord.ai/) to build the training data used to fine tune this model.


================================================
FILE: docs/finetuning-RedPajama-3B.md
================================================
# RedPajama-3B

In this tutorial, you will learn how to fine-tune a base LLM on a sample of data. By the end of 
the tutorial, you will have fine-tuned the RedPajama-INCITE-Chat-3B model using a sample of 
chat data from the OIG dataset. You can adapt this tutorial to fine-tune with your own data.

In order to fine-tune the RedPajama 3B models, please follow these steps:

First clone the OpenChatKit repo:

```shell
git clone git@github.com:togethercomputer/OpenChatKit.git
```

Next install dependencies as instructed by the OpenChatKit repo.

# Prepare Weights

```shell
python pretrained/RedPajama-3B/prepare.py
```

This script will download the weight from HuggingFace and prepare it for finetuning. The prepared weights will be saved at 

```
pretrained/RedPajama-3B/togethercomputer_RedPajama-INCITE-Chat-3B-v1
```

# Prepare Fine Tuning Data

We now need to preapre the training data.  We provide an example script that downloads a small slice of data from OIG. 
To download this sample dataset, please run:
 
```
bash data/OIG-chip2/prepare.sh
````
 
The sample dataset will be saved at 

```
data/OIG-chip2/unified_chip2.jsonl.
```

# Run Fine Tuning Script

We provide an example training script.  Please configure the parameters (e.g., learning_rate, batch_size, dataset_path) according to your hardware configuration. 
Then to start training, simply run

```
bash training/finetune_RedPajama-INCITE-Chat-3B-v1.sh
```

# Convert to Huggingface Format

The fine-tuned model will be saved to 

```
model_ckpts/rp-incite-chat-3b-finetuned/checkpoint_{steps}
```

In order to use it for inference, you will need to convert it to the HuggingFace format. To do so, run the following script 
(as an example, please change the checkpoint path, n-stages and n-layer-per-stage according to the training script):

The default for n-stages used in the training script is 10 and the n-layer-per-stage is 8.

```
python tools/convert_to_hf_gptneox.py --config-name togethercomputer/RedPajama-INCITE-Chat-3B-v1 --ckpt-path model_ckpts/redpajama-incite-chat-3b-sample/checkpoint_10/ --save-path model_ckpts/hf --n-stages 4 --n-layer-per-stage 8
```

Then you are ready to go! You can load the model with HuggingFace and use it for inference, for example:

```python
import torch
import transformers
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("togethercomputer/RedPajama-INCITE-Chat-3B-v1")
model = AutoModelForCausalLM.from_pretrained("./model_ckpts/hf", torch_dtype=torch.float16)
model = model.to('cuda:0')

prompt = "<human>: Who is Alan Turing?\n<bot>:"
inputs = tokenizer(prompt, return_tensors='pt').to(model.device)
input_length = inputs.input_ids.shape[1]
outputs = model.generate(
    **inputs, max_new_tokens=128, do_sample=True, temperature=0.7, top_p=0.7, top_k=50, return_dict_in_generate=True
)
token = outputs.sequences[0, input_length:]
output_str = tokenizer.decode(token)
print(output_str)

```

Please note the above finetuning takes around 60GB VRAM to fit everything in to GPU, and may take even more to fit training data. If you do not have such GPUs, we also provide the low-rank finetuning scripts that works with 14GB VRAM. Here’re the steps to get started.

* Clone the OpenChatKit repo, install dependencies and prepare the dataset. These steps are the same as full fine-tuning.

* The sample low-rank finetuning script is at /training/lora/redpajama-incite-chat-3b.py, please modify this script to accommodate your own training data and preferred configuration.

* Then you can start low-rank finetuning by running this script.

Once the finetuning is finished, the resulting low-rank adapter will be saved to /outputs, and you can do inference with the following script.

```
python training/lora/redpajama-incite-chat-3b_inference.py
```

================================================
FILE: environment.yml
================================================
name: OpenChatKit
channels:
  - pytorch
  - nvidia
  - conda-forge
  - defaults
dependencies:
  - cudatoolkit=11.8.0
  - cupy=12.1.0
  - faiss-gpu=1.7.2
  - fastparquet=0.5.0
  - nccl=2.18.3.1
  - pip=23.2
  - pyarrow=12.0.1
  - python=3.10.9
  - python-snappy=0.6.1
  - pytorch=2.0.1
  - pytorch-cuda=11.8
  - snappy=1.1.9
  - torchaudio=2.0.2
  - torchvision=0.15.2
  - pip:
      - accelerate==0.21.0
      - boto3
      - datasets==2.13.1
      - loguru==0.6.0
      - netifaces==0.11.0
      - pandas==2.0.3
      - transformers==4.31.0
      - wandb==0.15.5
      - zstandard==0.21.0
      - sentencepiece


================================================
FILE: inference/README.md
================================================
# OpenChatKit Inference
This directory contains code for OpenChatKit's inference.

## Arguments
- `--gpu-id`: Primary GPU device to load inputs onto for inference. Default: `0`
- `--model`: name/path of the model. Default = `../huggingface_models/GPT-NeoXT-Chat-Base-20B`
- `--max-tokens`: the maximum number of tokens to generate. Default: `128`
- `--sample`: indicates whether to sample. Default: `True`
- `--temperature`: temperature for the LM. Default: `0.6`
- `--top-k`: top-k for the LM. Default: `40`
- `--retrieval`: augment queries with context from the retrieval index. Default `False`
- `-g` `--gpu-vram`: GPU ID and VRAM to allocate to loading the model, separated by a `:` in the format `ID:RAM` where ID is the CUDA ID and RAM is in GiB. `gpu-id` must be present in this list to avoid errors. Accepts multiple values, for example, `-g ID_0:RAM_0 ID_1:RAM_1 ID_N:RAM_N`
- `-r` `--cpu-ram`: CPU RAM overflow allocation for loading the model. Optional, and only used if the model does not fit onto the GPUs given.

## Hardware requirements for inference
The GPT-NeoXT-Chat-Base-20B model requires at least 41GB of free VRAM. Used VRAM also goes up by ~100-200 MB per prompt. 

- A **minimum of 80 GB is recommended** 

- A **minimum of 48 GB in VRAM is recommended** for fast responses.

If you'd like to run inference on a GPU with <48 GB VRAM, refer to this section on [running on consumer hardware](#running-on-consumer-hardware).

By default, inference uses only CUDA Device 0.

**NOTE: Inference currently requires at least 1x GPU.**

## Running on multiple GPUs
Add the argument 

```-g ID0:MAX_VRAM ID1:MAX_VRAM ID2:MAX_VRAM ...``` 

where IDx is the CUDA ID of the device and MAX_VRAM is the amount of VRAM you'd like to allocate to the device.

For example, if you are running this on 4x 48 GB GPUs and want to distribute the model across all devices, add ```-g 0:10 1:12 2:12 3:12 4:12```. In this example, the first device gets loaded to a max of 10 GiB while the others are loaded with a max of 12 GiB.

How it works: The model fills up the max available VRAM on the first device passed and then overflows into the next until the whole model is loaded.

**IMPORTANT: This MAX_VRAM is only for loading the model. It does not account for the additional inputs that are added to the device. It is recommended to set the MAX_VRAM to be at least 1 or 2 GiB less than the max available VRAM on each device, and at least 3GiB less than the max available VRAM on the primary device (set by `gpu-id` default=0).**

**Decrease MAX_VRAM if you run into CUDA OOM. This happens because each input takes up additional space on the device.**

**NOTE: Total MAX_VRAM across all devices must be > size of the model in GB. If not, `bot.py` automatically offloads the rest of the model to RAM and disk. It will use up all available RAM. To allocate a specified amount of RAM: [refer to this section on running on consumer hardware](#running-on-consumer-hardware).**

## Running on specific GPUs
If you have multiple GPUs but would only like to use a specific device(s), [use the same steps as in this section on running on multiple devices](#running-on-multiple-gpus) and only specify the devices you'd like to use. 

Also, if needed, add the argument `--gpu-id ID` where ID is the CUDA ID of the device you'd like to make the primary device. NOTE: The device specified in `--gpu-id` must be present as one of the ID in the argument `-g` to avoid errors.

- **Example #1**: to run inference on devices 2 and 5 with a max of 25 GiB on each, and make device 5 the primary device, add: `--gpu-id 5 -g 2:25 5:25`. In this example, not adding `--gpu-id 5` will give you an error.
- **Example #2**: to run inference on devices 0 and 3 with a max of 10GiB on 0 and 40GiB on 3, with device 0 as the primary device, add: `-g 0:10 3:40`. In this example, `--gpu-id` is not required because device 0 is specified in `-g`.
- **Example #3**: to run inference only on device 1 with a max of 75 GiB, add: `--gpu-id 1 -g 1:75`


## Running on consumer hardware
If you have multiple GPUs, each <48 GB VRAM, [the steps mentioned in this section on running on multiple GPUs](#running-on-multiple-gpus) still apply, unless, any of these apply:
- Running on just 1x GPU with <48 GB VRAM,
- <48 GB VRAM combined across multiple GPUs
- Running into Out-Of-Memory (OOM) issues

In which case, add the flag `-r CPU_RAM` where CPU_RAM is the maximum amount of RAM you'd like to allocate to loading model. Note: This significantly reduces inference speeds. 

The model will load without specifying `-r`, however, it is not recommended because it will allocate all available RAM to the model. To limit how much RAM the model can use, add `-r`.

If the total VRAM + CPU_RAM < the size of the model in GiB, the rest of the model will be offloaded to a folder "offload" at the root of the directory. Note: This significantly reduces inference speeds.

- Example: `-g 0:12 -r 20` will first load up to 12 GiB of the model into the CUDA device 0, then load up to 20 GiB into RAM, and load the rest into the "offload" directory.

How it works: 
- https://github.com/huggingface/blog/blob/main/accelerate-large-models.md
- https://www.youtube.com/embed/MWCSGj9jEAo


================================================
FILE: inference/bot.py
================================================
import os
import sys

INFERENCE_DIR = os.path.dirname(os.path.abspath(__file__))

# TODO: PYTHONPATH hacks are never a good idea. clean this up later
sys.path.append(os.path.join(INFERENCE_DIR, '..'))

import cmd
import torch
import argparse
import conversation as convo
import retrieval.wikipedia as wp
from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig, StoppingCriteria, StoppingCriteriaList
from accelerate import infer_auto_device_map, init_empty_weights


class StopWordsCriteria(StoppingCriteria):
    def __init__(self, tokenizer, stop_words, stream_callback):
        self._tokenizer = tokenizer
        self._stop_words = stop_words
        self._partial_result = ''
        self._stream_buffer = ''
        self._stream_callback = stream_callback

    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
        first = not self._partial_result
        text = self._tokenizer.decode(input_ids[0, -1])
        self._partial_result += text
        for stop_word in self._stop_words:
            if stop_word in self._partial_result:
                return True
        if self._stream_callback:
            if first:
                text = text.lstrip()
            # buffer tokens if the partial result ends with a prefix of a stop word, e.g. "<hu"
            for stop_word in self._stop_words:
                for i in range(1, len(stop_word)):
                    if self._partial_result.endswith(stop_word[0:i]):
                        self._stream_buffer += text
                        return False
            self._stream_callback(self._stream_buffer + text)
            self._stream_buffer = ''
        return False


class ChatModel:
    human_id = "<human>"
    bot_id = "<bot>"

    def __init__(self, model_name, gpu_id, max_memory):
        device = torch.device('cuda', gpu_id)   # TODO: allow sending to cpu

        # recommended default for devices with > 40 GB VRAM
        # load model onto one device
        if max_memory is None:
            self._model = AutoModelForCausalLM.from_pretrained(
                model_name, torch_dtype=torch.float16, device_map="auto")
            self._model.to(device)
        # load the model with the given max_memory config (for devices with insufficient VRAM or multi-gpu)
        else:
            config = AutoConfig.from_pretrained(model_name)
            # load empty weights
            with init_empty_weights():
                model_from_conf = AutoModelForCausalLM.from_config(config)

            model_from_conf.tie_weights()

            # create a device_map from max_memory
            device_map = infer_auto_device_map(
                model_from_conf,
                max_memory=max_memory,
                no_split_module_classes=["GPTNeoXLayer"],
                dtype="float16"
            )
            # load the model with the above device_map
            self._model = AutoModelForCausalLM.from_pretrained(
                model_name,
                device_map=device_map,
                offload_folder="offload",  # optional offload-to-disk overflow directory (auto-created)
                offload_state_dict=True,
                torch_dtype=torch.float16
            )
        self._tokenizer = AutoTokenizer.from_pretrained(model_name)

    def do_inference(self, prompt, max_new_tokens, do_sample, temperature, top_k, stream_callback=None):
        stop_criteria = StopWordsCriteria(self._tokenizer, [self.human_id], stream_callback)
        inputs = (
            self._tokenizer(prompt, return_tensors='pt')
            .to(self._model.device)
        )
        outputs = self._model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=do_sample,
            temperature=temperature,
            top_k=top_k,
            pad_token_id=self._tokenizer.eos_token_id,
            stopping_criteria=StoppingCriteriaList([stop_criteria]),
        )
        output = self._tokenizer.batch_decode(outputs)[0]

        # remove the context from the output
        output = output[len(prompt):]

        return output


class OpenChatKitShell(cmd.Cmd):
    intro = "Welcome to OpenChatKit shell.   Type /help or /? to list commands.\n"
    prompt = ">>> "

    def __init__(self, gpu_id, model_name_or_path, max_tokens, sample, temperature, top_k, retrieval, max_memory, do_stream):
        super().__init__()
        self._gpu_id = gpu_id
        self._model_name_or_path = model_name_or_path
        self._max_tokens = max_tokens
        self._sample = sample
        self._temperature = temperature
        self._top_k = top_k
        self._retrieval = retrieval
        self._max_memory = max_memory
        self._do_stream = do_stream

    def preloop(self):
        print(f"Loading {self._model_name_or_path} to cuda:{self._gpu_id}...")
        self._model = ChatModel(self._model_name_or_path, self._gpu_id, self._max_memory)

        if self._retrieval:
            print(f"Loading retrieval index...")
            self._index = wp.WikipediaIndex()

        self._convo = convo.Conversation(
            self._model.human_id, self._model.bot_id)

    def precmd(self, line):
        if line.startswith('/'):
            return line[1:]
        else:
            return 'say ' + line

    def do_say(self, arg):
        if self._retrieval:
            results = self._index.search(arg)
            if len(results) > 0:
                self._convo.push_context_turn(results[0])

        self._convo.push_human_turn(arg)

        output = self._model.do_inference(
            self._convo.get_raw_prompt(),
            self._max_tokens,
            self._sample,
            self._temperature,
            self._top_k,
            lambda x : print(x, end='', flush=True) if self._do_stream else None,
        )

        self._convo.push_model_response(output)

        print("" if self._do_stream else self._convo.get_last_turn())

    def do_raw_say(self, arg):
        output = self._model.do_inference(
            arg,
            self._max_tokens,
            self._sample,
            self._temperature,
            self._top_k
        )

        print(output)

    def do_raw_prompt(self, arg):
        print(self._convo.get_raw_prompt())

    def do_reset(self, arg):
        self._convo = convo.Conversation(
            self._model.human_id, self._model.bot_id)

    def do_hyperparameters(self, arg):
        print(
            f"Hyperparameters:\n"
            f"  max_tokens: {self._max_tokens}\n"
            f"  sample: {self._sample}\n"
            f"  temperature: {self._temperature}\n"
            f"  top_k: {self._top_k}"
        )

    def do_quit(self, arg):
        return True


def main():
    parser = argparse.ArgumentParser(
        description='test harness for OpenChatKit')

    parser.add_argument(
        '--gpu-id',
        default=0,
        type=int,
        help='the ID of the GPU to run on'
    )
    parser.add_argument(
        '--model',
        default=f"{INFERENCE_DIR}/../huggingface_models/Pythia-Chat-Base-7B",
        help='name/path of the model'
    )
    parser.add_argument(
        '--max-tokens',
        default=128,
        type=int,
        help='the maximum number of tokens to generate'
    )
    parser.add_argument(
        '--sample',
        default=True,
        action='store_true',
        help='indicates whether to sample'
    )
    parser.add_argument(
        '--no-stream',
        action='store_true',
        help='indicates whether to stream tokens'
    )
    parser.add_argument(
        '--temperature',
        default=0.6,
        type=float,
        help='temperature for the LM'
    )
    parser.add_argument(
        '--top-k',
        default=40,
        type=int,
        help='top-k for the LM'
    )
    parser.add_argument(
        '--retrieval',
        default=False,
        action='store_true',
        help='augment queries with context from the retrieval index'
    )
    parser.add_argument(
        '-g',
        '--gpu-vram',
        action='store',
        help='max VRAM to allocate per GPU',
        nargs='+',
        required=False,
    )
    parser.add_argument(
        '-r',
        '--cpu-ram',
        default=None,
        type=int,
        help='max CPU RAM to allocate',
        required=False
    )
    args = parser.parse_args()

    # set max_memory dictionary if given
    if args.gpu_vram is None:
        max_memory = None
    else:
        max_memory = {}
        for i in range(len(args.gpu_vram)):
            # assign CUDA ID as label and XGiB as value
            max_memory[int(args.gpu_vram[i].split(':')[0])] = f"{args.gpu_vram[i].split(':')[1]}GiB"

        if args.cpu_ram is not None:
            # add cpu to max-memory if given
            max_memory['cpu'] = f"{int(args.cpu_ram)}GiB"

    OpenChatKitShell(
        args.gpu_id,
        args.model,
        args.max_tokens,
        args.sample,
        args.temperature,
        args.top_k,
        args.retrieval,
        max_memory,
        not args.no_stream,
    ).cmdloop()


if __name__ == '__main__':
    main()


================================================
FILE: inference/conversation.py
================================================
import re
import time

MEANINGLESS_WORDS = ['<pad>', '</s>', '<|endoftext|>']
PRE_PROMPT = """\
Current Date: {}
Current Time: {}

"""

def clean_response(response):
    for word in MEANINGLESS_WORDS:
        response = response.replace(word, "")
    response = response.strip("\n")
    return response

class Conversation:
    def __init__(self, human_id, bot_id):
        cur_date = time.strftime('%Y-%m-%d')
        cur_time = time.strftime('%H:%M:%S %p %Z')

        self._human_id = human_id
        self._bot_id = bot_id
        self._prompt = PRE_PROMPT.format(cur_date, cur_time)

    def push_context_turn(self, context):
        # for now, context is represented as a human turn
        self._prompt += f"{self._human_id}: {context}\n"

    def push_human_turn(self, query):
        self._prompt += f"{self._human_id}: {query}\n"
        self._prompt += f"{self._bot_id}:"

    def push_model_response(self, response):
        has_finished = self._human_id in response
        bot_turn = response.split(f"{self._human_id}:")[0]
        bot_turn = clean_response(bot_turn)
        # if it is truncated, then append "..." to the end of the response
        if not has_finished:
            bot_turn += "..."

        self._prompt += f"{bot_turn}\n"

    def get_last_turn(self):
        human_tag = f"{self._human_id}:"
        bot_tag = f"{self._bot_id}:"
        turns = re.split(f"({human_tag}|{bot_tag})\W?", self._prompt)
        return turns[-1]

    def get_raw_prompt(self):
        return self._prompt

    @classmethod
    def from_raw_prompt(cls, value):
        self._prompt = value


================================================
FILE: pretrained/GPT-NeoX-20B/prepare.py
================================================
import sys
import os

# Import the prepare_data function
current_dir = os.path.dirname(os.path.abspath(__file__))
sys.path.append(os.path.join(current_dir, '..'))
from prepare_pretrained import prepare_pretrained

if __name__ == "__main__":
    model_name = "EleutherAI/gpt-neox-20b"
    save_path = os.path.join(current_dir, model_name.replace('/', '_'))
    prepare_pretrained(save_path, model_name)


================================================
FILE: pretrained/Llama-2-7B-32K-beta/prepare.py
================================================
import os
import argparse
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig

DIR = os.path.dirname(os.path.abspath(__file__))
USE_AUTH_TOKEN = False

if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='Convert HF checkpoints')
    parser.add_argument('--model-name', type=str, default='togethercomputer/Llama-2-7B-32K-beta', 
                        help='model-name')
    parser.add_argument('--save-dir', type=str, default=DIR, 
                        help='model-name')
    parser.add_argument('--offload-dir', type=str, default=None,
                        help='directory to offload from memory')
    args = parser.parse_args()
    
    if not os.path.exists(args.save_dir):
        os.mkdir(args.save_dir)
    save_path = os.path.join(args.save_dir, args.model_name.replace('/', '_'))
    if not os.path.exists(save_path):
        os.mkdir(save_path)
    
    print('loading model from HF...')
    config = AutoConfig.from_pretrained(args.model_name, use_auth_token=USE_AUTH_TOKEN)
    config.save_pretrained(save_path)
    tokenizer = AutoTokenizer.from_pretrained(args.model_name, use_auth_token=USE_AUTH_TOKEN)
    tokenizer.save_pretrained(save_path)

    # offload model from memory to disk if offload-dir is specified
    if args.offload_dir is not None:
        if not os.path.exists(args.offload_dir):
            os.mkdir(args.offload_dir)
        model = AutoModelForCausalLM.from_pretrained(args.model_name, torch_dtype=torch.float16, device_map="auto", offload_folder=args.offload_dir, use_auth_token=USE_AUTH_TOKEN)
    else:
        model = AutoModelForCausalLM.from_pretrained(args.model_name, torch_dtype=torch.float16, use_auth_token=USE_AUTH_TOKEN)
    print('loaded model from HF...')
    
    print('converting the embedding layer...')
    item = {}
    item['embed_tokens.weight'] = model.model.embed_tokens.weight
    torch.save(item, os.path.join(save_path, 'pytorch_embs.pt'))
    print('converted the embedding layer.')

    for i in range(len(model.model.layers)):
        print(f'converting the {i}-th transformer layer...')
        torch.save(model.model.layers[i].state_dict(), os.path.join(save_path, f'pytorch_{i}.pt'))
        print(f'converted the {i}-th transformer layer.')

    print('converting the lm_head layer...')
    item = {}
    item['lm_head.weight'] = model.lm_head.weight
    item['norm.weight'] = model.model.norm.weight
    torch.save(item, os.path.join(save_path, 'pytorch_lm_head.pt'))
    print('converted the lm_head layer.')


================================================
FILE: pretrained/Pythia-6.9B-deduped/prepare.py
================================================
import sys
import os

# Import the prepare_data function
current_dir = os.path.dirname(os.path.abspath(__file__))
sys.path.append(os.path.join(current_dir, '..'))
from prepare_pretrained import prepare_pretrained

if __name__ == "__main__":
    model_name = "EleutherAI/pythia-6.9b-deduped"
    save_path = os.path.join(current_dir, model_name.replace('/', '_'))
    prepare_pretrained(save_path, model_name)


================================================
FILE: pretrained/RedPajama-3B/prepare.py
================================================
import os
import sys

# Import the prepare_data function
current_dir = os.path.dirname(os.path.abspath(__file__))
sys.path.append(os.path.join(current_dir, '..'))
from prepare_pretrained import prepare_pretrained

if __name__ == "__main__":
    model_name = "togethercomputer/RedPajama-INCITE-Chat-3B-v1"
    save_path = os.path.join(current_dir, model_name.replace('/', '_'))
    prepare_pretrained(save_path, model_name)


================================================
FILE: pretrained/RedPajama-7B/prepare.py
================================================
import os
import argparse
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig

DIR = os.path.dirname(os.path.abspath(__file__))
USE_AUTH_TOKEN = False

if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='Convert HF checkpoints')
    parser.add_argument('--model-name', type=str, default='togethercomputer/RedPajama-INCITE-7B-Chat', 
                        help='model-name')
    parser.add_argument('--save-dir', type=str, default=DIR, 
                        help='model-name')
    parser.add_argument('--offload-dir', type=str, default=None,
                        help='directory to offload from memory')
    args = parser.parse_args()
    
    if not os.path.exists(args.save_dir):
        os.mkdir(args.save_dir)
    save_path = os.path.join(args.save_dir, args.model_name.replace('/', '_'))
    if not os.path.exists(save_path):
        os.mkdir(save_path)
    
    print('loading model from HF...')
    config = AutoConfig.from_pretrained(args.model_name, use_auth_token=USE_AUTH_TOKEN)
    config.save_pretrained(save_path)
    tokenizer = AutoTokenizer.from_pretrained(args.model_name, use_auth_token=USE_AUTH_TOKEN)
    tokenizer.save_pretrained(save_path)

    # offload model from memory to disk if offload-dir is specified
    if args.offload_dir is not None:
        if not os.path.exists(args.offload_dir):
            os.mkdir(args.offload_dir)
        model = AutoModelForCausalLM.from_pretrained(args.model_name, torch_dtype=torch.float16, device_map="auto", offload_folder=args.offload_dir, use_auth_token=USE_AUTH_TOKEN)
    else:
        model = AutoModelForCausalLM.from_pretrained(args.model_name, torch_dtype=torch.float16, use_auth_token=USE_AUTH_TOKEN)
    print('loaded model from HF...')
    
    print('converting the embedding layer...')
    
    item = {}
    item['embed_in.weight'] = model.gpt_neox.embed_in.weight
    torch.save(item, os.path.join(save_path, 'pytorch_embs.pt'))
    print('converted the embedding layer.')

    for i in range(len(model.gpt_neox.layers)):
        print(f'converting the {i}-th transformer layer...')
        torch.save(model.gpt_neox.layers[i].state_dict(), os.path.join(save_path, f'pytorch_{i}.pt'))
        print(f'converted the {i}-th transformer layer.')

    print('converting the lm_head layer...')
    item = {}
    item['embed_out.weight'] = model.embed_out.weight
    item['final_layer_norm.weight'] = model.gpt_neox.final_layer_norm.weight
    item['final_layer_norm.bias'] = model.gpt_neox.final_layer_norm.bias
    torch.save(item, os.path.join(save_path, 'pytorch_lm_head.pt'))
    print('converted the lm_head layer.')


================================================
FILE: pretrained/prepare_pretrained.py
================================================
import os
import argparse
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig

DIR = os.path.dirname(os.path.abspath(__file__))
USE_AUTH_TOKEN = False

# Load pretrained model from HuggingFace and save it to disk
def prepare_pretrained(save_path, model_name, offload_dir=None):
    os.makedirs(save_path, exist_ok=True)
    
    print('loading model from HF...')
    config = AutoConfig.from_pretrained(model_name, use_auth_token=USE_AUTH_TOKEN)
    config.save_pretrained(save_path)
    tokenizer = AutoTokenizer.from_pretrained(model_name, use_auth_token=USE_AUTH_TOKEN)
    tokenizer.save_pretrained(save_path)

    # offload model from memory to disk if offload-dir is specified
    if offload_dir is not None:
        os.makedirs(offload_dir, exist_ok=True)
        model = AutoModelForCausalLM.from_pretrained(model_name, 
                                                     torch_dtype=torch.float16,
                                                     device_map="auto",
                                                     offload_folder=offload_dir,
                                                     use_auth_token=USE_AUTH_TOKEN)
    else:
        model = AutoModelForCausalLM.from_pretrained(model_name,
                                                     torch_dtype=torch.float16,
                                                     use_auth_token=USE_AUTH_TOKEN)
    print('loaded model from HF...')
    
    print('converting the embedding layer...')
    item = {}
    item['embed_in.weight'] = model.gpt_neox.embed_in.weight
    torch.save(item, os.path.join(save_path, 'pytorch_embs.pt'))
    print('converted the embedding layer.')

    for i in range(len(model.gpt_neox.layers)):
        print(f'converting the {i}-th transformer layer...')
        torch.save(model.gpt_neox.layers[i].state_dict(), os.path.join(save_path, f'pytorch_{i}.pt'))
        print(f'converted the {i}-th transformer layer.')

    print('converting the lm_head layer...')
    item = {}
    item['embed_out.weight'] = model.embed_out.weight
    item['final_layer_norm.weight'] = model.gpt_neox.final_layer_norm.weight
    item['final_layer_norm.bias'] = model.gpt_neox.final_layer_norm.bias
    torch.save(item, os.path.join(save_path, 'pytorch_lm_head.pt'))
    print('converted the lm_head layer.')

# python pretrained/prepare_pretrained.py --model-name EleutherAI/gpt-neox-125M --save-dir pretrained/files --offload-dir pretrained/files/offload
def main():
    parser = argparse.ArgumentParser(description='Convert HF checkpoints')
    parser.add_argument('--model-name', type=str, required=True, 
                        help='model-name')
    parser.add_argument('--save-dir', type=str, required=True, 
                        help='model-name')
    parser.add_argument('--offload-dir', type=str, default=None,
                        help='directory to offload from memory')
    args = parser.parse_args()
    
    prepare_pretrained(args.save_dir, args.model_name, args.offload_dir)

if __name__ == '__main__':
    main()

================================================
FILE: retrieval/README.md
================================================
# Retrieval-Enhanced Chatbot

This is a demonstration of how to enhance a chatbot using Wikipedia. We'll be using [ChristophSchuhmann/wikipedia-3sentence-level-retrieval-index](https://huggingface.co/datasets/ChristophSchuhmann/wikipedia-3sentence-level-retrieval-index). for this demo. Thank Christoph for providing this resource!

In this demo, we'll be extending the approach of comparing and adding the adjacent `w` sentences to the matched sentence if their cosine similarity is larger than `w_th`. By doing so, we can provide the chatbot with a longer context, which may improve its performance.

This demo combines both the above index and the chat model into one system

## Start the combined  server

To get started, we need to install some dependencies and download the Wikipedia index:

0. Install dependencies

Install the necessary dependencies, including `torch`, `transformers`, `flask`, `faiss`, and `fastparquet`.

1. Open up wiki-server.py and set model_name_or_path to point to the path that contains the chat
model


2. Start the retrieval server

```shell
python wiki-server.py
```

The server will listen on port 7003.  It will download the data sets from ChristophSchuhman.  This
may take a few minutes.

3. Test the full retrieval enhanced chatbot

We now demonstrate both the wiki index and the GPT-NeoX-fine-tuned model.

```curl -X POST -H 'Content-Type: application/json' http://127.0.0.1:7003/inference -d '{ "prompt" : "where is zurich located?" }'```

Internally we first query the wiki index and generate a response using the provided model.  To do
this, We concatenate the retrieved information and the users' query into a prompt, 
encode it with a tokenizer, and generate a response using the chatbot model.

The response should indicate the location of Zurich city.


4. To test just the retrieval functionality of the system you can can do the following.  Curl works
as well.

```python
import requests

endpoint = 'http://127.0.0.1:7003/search'
res = requests.post(endpoint, json={
    'query': 'Where is Zurich?',
    'k': 1,
    'w': 5,
    'w_th': 0.7,
})
print(res.json())
```

This should print the most relevant sentences about Zurich from Wikipedia. By increasing w and 
decreasing w_th, we can retrieve a longer context.




================================================
FILE: retrieval/__init__.py
================================================


================================================
FILE: retrieval/wikipedia.py
================================================
# This file was adapted from ChristophSchuhmann/wikipedia-3sentence-level-retrieval-index:
#   https://huggingface.co/datasets/ChristophSchuhmann/wikipedia-3sentence-level-retrieval-index/blob/main/wikiindexquery.py
#
# The original file was licensed under the Apache 2.0 license.

import os

from transformers import AutoTokenizer, AutoModel
import faiss
import numpy as np
import pandas as pd

DIR = os.path.dirname(os.path.abspath(__file__))


def mean_pooling(token_embeddings, mask):
    token_embeddings = token_embeddings.masked_fill(~mask[..., None].bool(), 0.)
    sentence_embeddings = token_embeddings.sum(dim=1) / mask.sum(dim=1)[..., None]
    return sentence_embeddings

def cos_sim_2d(x, y):
    norm_x = x / np.linalg.norm(x, axis=1, keepdims=True)
    norm_y = y / np.linalg.norm(y, axis=1, keepdims=True)
    return np.matmul(norm_x, norm_y.T)


class WikipediaIndex:
    def __init__(self):
        path = os.path.join(DIR, '..', 'data', 'wikipedia-3sentence-level-retrieval-index', 'files')
        indexpath = os.path.join(path, 'knn.index')
        wiki_sentence_path = os.path.join(path, 'wikipedia-en-sentences.parquet')

        self._device = 'cuda'
        self._tokenizer = AutoTokenizer.from_pretrained('facebook/contriever-msmarco')
        self._contriever = AutoModel.from_pretrained('facebook/contriever-msmarco').to(self._device)

        self._df_sentences = pd.read_parquet(wiki_sentence_path, engine='fastparquet')

        self._wiki_index = faiss.read_index(indexpath, faiss.IO_FLAG_MMAP | faiss.IO_FLAG_READ_ONLY)


    def search(self, query, k=1, w=5, w_th=0.5):
        inputs = self._tokenizer(query, padding=True, truncation=True, return_tensors='pt').to(self._device)
        outputs = self._contriever(**inputs)
        embeddings = mean_pooling(outputs[0], inputs['attention_mask'])
        
        query_vector = embeddings.cpu().detach().numpy().reshape(1, -1)
        
        distances, indices = self._wiki_index.search(query_vector, k)
        
        texts = []
        for i, (dist, indice) in enumerate(zip(distances[0], indices[0])):
            text = self._df_sentences.iloc[indice]['text_snippet']

            try:
                input_texts = [self._df_sentences.iloc[indice]['text_snippet']]
                for j in range(1, w+1):
                    input_texts = [self._df_sentences.iloc[indice-j]['text_snippet']] + input_texts
                for j in range(1, w+1):
                    input_texts = input_texts + [self._df_sentences.iloc[indice+j]['text_snippet']]
                
                inputs = self._tokenizer(input_texts, padding=True, truncation=True, return_tensors='pt').to(self._device)

                outputs = self._contriever(**inputs)
                embeddings = mean_pooling(outputs[0], inputs['attention_mask']).detach().cpu().numpy()

                for j in range(1, w+1):
                    if cos_sim_2d(embeddings[w-j].reshape(1, -1), embeddings[w].reshape(1, -1)) > w_th:
                        text = self._df_sentences.iloc[indice-j]['text_snippet'] + text
                    else:
                        break

                for j in range(1, w+1):
                    if cos_sim_2d(embeddings[w+j].reshape(1, -1), embeddings[w].reshape(1, -1)) > w_th:
                        text += self._df_sentences.iloc[indice+j]['text_snippet']
                    else:
                        break

            except Exception as e:
                print(e)

            texts.append(text)
        
        return texts


================================================
FILE: tools/README.md
================================================
# OpenChatKit Tools

## convert_to_hf_gptneox.py

## ml_load_benchmark.py

The commands to run the model load benchmark tool is:
```shell
$ python3 model_load_benchmark.py -i benchmark_input.json -o benchmark_results.json -d cuda:0
```

```
usage: model_load_benchmark.py [-h] -i INPUT -o OUTPUT [-d DEVICE] [-r REPEAT_INFER]

Benchmark downloading, loading, and running an inferernce for a set of ML models.

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT, --input INPUT
                        Input JSON file containing models to be benchmark
  -o OUTPUT, --output OUTPUT
                        Output JSON file with model benchmark results
  -d DEVICE, --device DEVICE
                        Cuda device name, e.g. "cuda:0"
  -r REPEAT_INFER, --repeat-infer REPEAT_INFER
                        Repeat inferrence for warm timings
```

The input file is a JSON file with the names and paths of the models to be tested. For example:
```JSON
{
    "GPT-NeoXT-Chat-Base-20B": "togethercomputer/GPT-NeoXT-Chat-Base-20B",
    "Pythia-Chat-Base-7B": "togethercomputer/Pythia-Chat-Base-7B",
    "GPT-JT-Moderation-6B": "togethercomputer/GPT-JT-Moderation-6B",
    "GPT-JT-6B-v1": "togethercomputer/GPT-JT-6B-v1",
    "GPT-JT-6B-v0": "togethercomputer/GPT-JT-6B-v0"
}
```

The output is a json file with the timings for:
1. tokenizer download time in seconds -- `tokenizer_download_sec`
2. tokenizer load time in seconds -- `tokenizer_load_sec`
3. model download time -- `model_download_sec`
5. model load to RAM time -- `model_load_to_ram_sec`
6. model transfer to GPU time -- `model_transfer_to_gpu_sec`
7. inference time (input is "hello, world!") -- `inference_sec`
8. total time (sum of all the above) -- `total_sec`
9. inference time from a warm start (the average of running inference `REPEAT_INFER` times) -- `inference_warm_sec`
10. model main memory footprint in MB -- `model_main_memory_MB`
11. model GPU memory footprint in MB -- `model_gpu_memory_MB`

An example of the output is:
```JSON
{
    "GPT-JT-6B-v1": {
        "tokenizer_download_sec": 1.52,
        "tokenizer_load_sec": 0.10,
        "model_download_sec": 124.70,
        "model_load_to_ram_sec": 127.81,
        "model_main_memory_MB": 12297.10,
        "model_transfer_to_gpu_sec": 3.29,
        "model_gpu_memory_MB": 12219.74,
        "inference_sec": 0.93,
        "inference_warm_sec": 0.047,
        "total_sec": 258.38
    }
}
```

================================================
FILE: tools/benchmark_input.json
================================================
{
    "GPT-NeoXT-Chat-Base-20B": "togethercomputer/GPT-NeoXT-Chat-Base-20B",
    "Pythia-Chat-Base-7B": "togethercomputer/Pythia-Chat-Base-7B",
    "GPT-JT-Moderation-6B": "togethercomputer/GPT-JT-Moderation-6B",
    "GPT-JT-6B-v1": "togethercomputer/GPT-JT-6B-v1",
    "GPT-JT-6B-v0": "togethercomputer/GPT-JT-6B-v0"
}

================================================
FILE: tools/convert_to_hf_gptneox.py
================================================
import torch
import torch.nn as nn

import argparse

from transformers import GPTNeoXForCausalLM

from transformers import AutoConfig, AutoTokenizer

from transformers.modeling_utils import no_init_weights
import os


def create_empty_gptneox(config):

    import torch
    import torch.nn as nn

    _reset_parameters_linear = nn.Linear.reset_parameters
    def dummy(*args, **kargs):
        pass
    nn.Linear.reset_parameters = dummy

    # 1. disable init for faster initialization
    # 2. avoid tie token embeddings with lm_head, as we train them separately.
    with no_init_weights(_enable=True):
        model = GPTNeoXForCausalLM(config).eval()

    nn.Linear.reset_parameters = _reset_parameters_linear

    return model

def load_decentralized_checkpoint(model, checkpoint_path, n_stages=2, n_layer_per_stage=14):
    input_path = checkpoint_path

    assert n_stages * n_layer_per_stage >= len(model.gpt_neox.layers)
    # assert model.lm_head.weight.data is not model.transformer.wte.weight.data

    for i in range(n_stages):

        print(f'loading stage {i}')

        checkpoint = torch.load(os.path.join(input_path, f'prank_{i}_checkpoint.pt'), map_location=torch.device("cpu"))

        if i == 0:
            _tmp = {k[len(f"{0}."):]:v for k,v in checkpoint.items() if k.startswith(f"0.")}
            # torch.save(_tmp, os.path.join(output_path, f'pytorch_embs.pt'))
            model.gpt_neox.embed_in.weight.data[:] = _tmp['embed_in.weight']

            for j in range(n_layer_per_stage):
                _tmp = {k[len(f"{j+1}."):]:v for k,v in checkpoint.items() if k.startswith(f"{j+1}.")}
                if len(_tmp) == 0:
                    break
                # torch.save(_tmp, os.path.join(output_path, f'pytorch_{j}.pt'))
                model.gpt_neox.layers[j].load_state_dict(_tmp)

        elif i == n_stages - 1:
            for j in range(n_layer_per_stage):
                _tmp = {k[len(f"{j}."):]:v for k,v in checkpoint.items() if k.startswith(f"{j}.")}
                if len(_tmp) == 0:
                    break
                # torch.save(_tmp, os.path.join(output_path, f'pytorch_{i*n_layer_per_stage + j}.pt'))
                model.gpt_neox.layers[i*n_layer_per_stage + j].load_state_dict(_tmp)
                if i*n_layer_per_stage + j == len(model.gpt_neox.layers) - 1:
                    j += 1
                    break

            _tmp = {k[len(f"{j}."):]:v for k,v in checkpoint.items() if k.startswith(f"{j}.")}
            if len(_tmp) == 0:
                break
            # torch.save(_tmp, os.path.join(output_path, f'pytorch_lm_head.pt'))
            model.gpt_neox.final_layer_norm.weight.data[:] = _tmp['final_layer_norm.weight']
            model.gpt_neox.final_layer_norm.bias.data[:] = _tmp['final_layer_norm.bias']
            model.embed_out.weight.data[:] = _tmp['embed_out.weight']
            if 'embed_out.bias' in _tmp:
                model.embed_out.bias.data[:] = _tmp['embed_out.bias']

        else:
            for j in range(n_layer_per_stage):
                _tmp = {k[len(f"{j}."):]:v for k,v in checkpoint.items() if k.startswith(f"{j}.")}
                if len(_tmp) == 0:
                    break
                # torch.save(_tmp, os.path.join(output_path, f'pytorch_{i*n_layer_per_stage + j}.pt'))
                model.gpt_neox.layers[i*n_layer_per_stage + j].load_state_dict(_tmp)

    return model


if __name__ == '__main__':
    
    parser = argparse.ArgumentParser(description='Convert HF checkpoints')
    parser.add_argument('--config-name', type=str, default='EleutherAI/gpt-neox-20b',
                        help='config-name')
    parser.add_argument('--ckpt-path', type=str, default=None, 
                        help='ckpt-path')
    parser.add_argument('--save-path', type=str, default=None, 
                        help='save-path')
    parser.add_argument('--n-stages', type=int, default=8, 
                        help='pipeline group size')
    parser.add_argument('--n-layer-per-stage', type=int, default=6, 
                        help='n layers per GPU device')
    parser.add_argument('--fp16', default=False, action='store_true')
    args = parser.parse_args()
    
    assert args.ckpt_path is not None
    assert args.save_path is not None
    
    os.makedirs(args.save_path, exist_ok=True)

    print('loading config...')
    config = AutoConfig.from_pretrained(args.config_name)
    print('loaded config.')
    print('loading tokenizer...')
    tokenizer = AutoTokenizer.from_pretrained(args.config_name)
    print('loaded tokenizer.')
    print('creating empty model...')
    model = create_empty_gptneox(config)
    if args.fp16:
        model = model.half()
    print('created empty model.')
    print('loading model ckpt...')
    load_decentralized_checkpoint(
        model, args.ckpt_path, n_stages=args.n_stages, n_layer_per_stage=args.n_layer_per_stage,
    )
    print('loaded model ckpt.')
    
    print('saving HF model...')
    model.save_pretrained(args.save_path)
    print(f'saved HF model to `{args.save_path}`')
    config.save_pretrained(args.save_path)
    tokenizer.save_pretrained(args.save_path)
    


================================================
FILE: tools/convert_to_hf_llama.py
================================================
import os
import argparse
import torch

import torch
import torch.nn as nn

from transformers import LlamaForCausalLM
from transformers import AutoConfig, AutoTokenizer

from transformers.modeling_utils import no_init_weights
import os


def create_emtpy_llama(config):

    import torch
    import torch.nn as nn

    _reset_parameters_linear = nn.Linear.reset_parameters
    def dummy(*args, **kargs):
        pass
    nn.Linear.reset_parameters = dummy

    # 1. disable init for faster initialization
    # 2. avoid tie token embeddings with lm_head, as we train them separately.
    with no_init_weights(_enable=True):
        model = LlamaForCausalLM(config).eval()

    nn.Linear.reset_parameters = _reset_parameters_linear

    return model

def load_decentralized_checkpoint(model, checkpoint_path, n_stages=2, n_layer_per_stage=16, ):
    input_path = checkpoint_path

    n_layers = len(model.model.layers)
    assert n_stages * n_layer_per_stage >= len(model.model.layers)
    # assert model.lm_head.weight.data is not model.transformer.wte.weight.data

    for i in range(n_stages):

        print(f'loading stage {i}')

        checkpoint = torch.load(os.path.join(input_path, f'prank_{i}_checkpoint.pt'), map_location=torch.device("cpu"))

        if i == 0:
            _tmp = {k[len(f"{0}."):]:v for k,v in checkpoint.items() if k.startswith(f"0.")}
            # torch.save(_tmp, os.path.join(output_path, f'pytorch_embs.pt'))
            model.model.embed_tokens.weight.data[:] = _tmp['embed_tokens.weight']

            for j in range(n_layer_per_stage):
                _tmp = {k[len(f"{j+1}."):]:v for k,v in checkpoint.items() if k.startswith(f"{j+1}.")}
                if len(_tmp) == 0:
                    break
                # torch.save(_tmp, os.path.join(output_path, f'pytorch_{j}.pt'))
                ret = model.model.layers[j].load_state_dict(_tmp, strict=False)
                if len(ret.missing_keys):
                    print('The following weight keys are missing:')
                    print(ret.missing_keys)
                if len(ret.unexpected_keys):
                    print('The following weight keys are unexpected:')
                    print(ret.unexpected_keys)

        elif i == n_stages - 1:
            for j in range(n_layer_per_stage):
                if i*n_layer_per_stage + j == n_layers:
                    break
                _tmp = {k[len(f"{j}."):]:v for k,v in checkpoint.items() if k.startswith(f"{j}.")}
                if len(_tmp) == 0:
                    break
                # torch.save(_tmp, os.path.join(output_path, f'pytorch_{i*n_layer_per_stage + j}.pt'))
                ret = model.model.layers[i*n_layer_per_stage + j].load_state_dict(_tmp, strict=False)
                if len(ret.missing_keys):
                    print('The following weight keys are missing:')
                    print(ret.missing_keys)
                if len(ret.unexpected_keys):
                    print('The following weight keys are unexpected:')
                    print(ret.unexpected_keys)
            else:
                j += 1

            _tmp = {k[len(f"{j}."):]:v for k,v in checkpoint.items() if k.startswith(f"{j}.")}
            if len(_tmp) == 0:
                break
            # torch.save(_tmp, os.path.join(output_path, f'pytorch_lm_head.pt'))
            model.model.norm.weight.data[:] = _tmp['norm.weight']
            if 'norm.bias' in _tmp:
                model.model.norm.bias.data[:] = _tmp['norm.bias']
            model.lm_head.weight.data[:] = _tmp['lm_head.weight']
            if 'lm_head.bias' in _tmp:
                model.lm_head.bias.data[:] = _tmp['lm_head.bias']

        else:
            for j in range(n_layer_per_stage):
                _tmp = {k[len(f"{j}."):]:v for k,v in checkpoint.items() if k.startswith(f"{j}.")}
                if len(_tmp) == 0:
                    break
                # torch.save(_tmp, os.path.join(output_path, f'pytorch_{i*n_layer_per_stage + j}.pt'))
                ret = model.model.layers[i*n_layer_per_stage + j].load_state_dict(_tmp, strict=False)
                if len(ret.missing_keys):
                    print('The following weight keys are missing:')
                    print(ret.missing_keys)
                if len(ret.unexpected_keys):
                    print('The following weight keys are unexpected:')
                    print(ret.unexpected_keys)

    return model


if __name__ == '__main__':
    
    parser = argparse.ArgumentParser(description='Convert HF checkpoints')
    parser.add_argument('--config-name', type=str, default='togethercomputer/Llama-2-7B-32K-beta',
                        help='config-name')
    parser.add_argument('--ckpt-path', type=str, default=None, 
                        help='ckpt-path')
    parser.add_argument('--save-path', type=str, default=None, 
                        help='save-path')
    parser.add_argument('--n-stages', type=int, default=8, 
                        help='pipeline group size')
    parser.add_argument('--n-layer-per-stage', type=int, default=4, 
                        help='n layers per GPU device')
    parser.add_argument('--fp16', default=False, action='store_true')
    args = parser.parse_args()
    
    assert args.ckpt_path is not None
    assert args.save_path is not None
    
    if not os.path.exists(args.save_path):
        os.mkdir(args.save_path)

    # LlamaForCausalLM LlamaConfig LlamaTokenizer
    print('loading config...')
    config = AutoConfig.from_pretrained(args.config_name)
    print('loaded config.')
    print('loading tokenizer...')
    tokenizer = AutoTokenizer.from_pretrained(args.config_name)
    print('loaded tokenizer.')
    print('creating empty model...')
    model = create_emtpy_llama(config)
    if args.fp16:
        model = model.half()
    print('created empty model.')
    print('loading model ckpt...')
    load_decentralized_checkpoint(
        model, args.ckpt_path, n_stages=args.n_stages, n_layer_per_stage=args.n_layer_per_stage,
    )
    print('loaded model ckpt.')
    
    print('saving HF model...')
    model.save_pretrained(args.save_path)
    print(f'saved HF model to `{args.save_path}`')
    config.save_pretrained(args.save_path)
    tokenizer.save_pretrained(args.save_path)


================================================
FILE: tools/model_load_benchmark.py
================================================
import argparse
import json
import time
import torch
import torchvision
import os
import re
import psutil
from transformers import AutoTokenizer, AutoModelForCausalLM

# Benchmark download, tokenize, load, inference time.
def benchmark(model_dict: dict, device_name: str, repeat_infer: int):

    # Initialize the benchmark results dictionary
    results_dict = {}

    # Check that we have CUDA GPUs available before running the benchmark
    if not torch.cuda.is_available():
        print("ERROR: CUDA GPUs are not available, benchmark not run")
        return results_dict

    device = torch.device(device_name)

    process = psutil.Process()

    print(f'Using device {device}')

    # Loop through the models to test
    for model_name, model_path in model_dict.items():
        # purge unused cached memory
        torch.cuda.empty_cache()

        print(f"Testing model: {model_name}")

        # Measure the time it takes to download the tokenizer data and load the tokenizer
        tokenizer_download_start_time = time.time()
        tokenizer = AutoTokenizer.from_pretrained(model_path, force_download=True)
        tokenizer_download_end_time = time.time()

        tokenizer = None

        # Measure the time it takes to  load the tokenizer
        tokenizer_load_start_time = time.time()
        tokenizer = AutoTokenizer.from_pretrained(model_path)
        tokenizer_load_end_time = time.time()

        tokenizer_load_sec = tokenizer_load_end_time - tokenizer_load_start_time
        tokenizer_download_sec = tokenizer_download_end_time - tokenizer_download_start_time - tokenizer_load_sec

        print(f"Testing model: {model_name} --- tokenizer download time = {tokenizer_download_sec:.3} sec")
        print(f"Testing model: {model_name} --- tokenize load time = {tokenizer_load_sec:.3} sec")

        # Measure the time it takes to download and load the model into main memory
        model_download_start_time = time.time()
        model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.float16, torchscript=True, force_download=True)
        model_download_end_time = time.time()
        
        model = None

        # Measure the time it takes to load the model into main memory
        memory_used_main_start = process.memory_info().rss
        model_load_to_ram_start_time = time.time()
        model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.float16, torchscript=True)
        model_load_to_ram_end_time = time.time()
        memory_used_main_end = process.memory_info().rss

        model_load_to_ram_sec = model_load_to_ram_end_time - model_load_to_ram_start_time
        model_download_sec = model_download_end_time - model_download_start_time - model_load_to_ram_sec
        model_main_memory_bytes = memory_used_main_end - memory_used_main_start

        print(f"Testing model: {model_name} --- model download time = {model_download_sec:.3} sec")
        print(f"Testing model: {model_name} --- model load to RAM time = {model_load_to_ram_sec:.3} sec")
        print(f"Testing model: {model_name} --- model main memory size = {model_main_memory_bytes} bytes")

        # Measure the time it takes to load the model from main memory to the GPU
        gpu_memory_start = torch.cuda.memory_allocated(device)
        model_xfer_to_gpu_start_time = time.time()
        model = model.to(device)
        model_xfer_to_gpu_end_time = time.time()
        gpu_memory_end = torch.cuda.memory_allocated(device)

        model_xfer_to_gpu_sec = model_xfer_to_gpu_end_time - model_xfer_to_gpu_start_time
        model_gpu_memory_bytes = gpu_memory_end - gpu_memory_start

        print(f"Testing model: {model_name} --- model transfer to GPU time = {model_xfer_to_gpu_sec:.3} sec")
        print(f"Testing model: {model_name} --- model GPU memory size = {model_gpu_memory_bytes} bytes")

        # Measure the time it takes to run inference from a cold start
        inference_start_time = time.time()
        inputs = tokenizer("Hello, world!", return_tensors="pt").to(device)
        outputs = model(**inputs)
        inference_end_time = time.time()
        inference_sec = inference_end_time - inference_start_time

        print(f"Testing model: {model_name} --- inference time = {inference_sec:.3} sec")

        # Measure the time it takes to run inference from a cold start
        inference_warm_start_time = time.time()
        for i in range(0, repeat_infer):
            inputs = tokenizer("Hello, world!", return_tensors="pt").to(device)
            outputs = model(**inputs)
        inference_warm_end_time = time.time()
        inference_warm_sec = (inference_warm_end_time - inference_warm_start_time) / float(repeat_infer)

        print(f"Testing model: {model_name} --- inference warm time = {inference_warm_sec:.3} sec")

        total_sec = tokenizer_download_sec + tokenizer_load_sec + model_download_sec + model_load_to_ram_sec + model_xfer_to_gpu_sec + inference_sec

        print(f"Testing model: {model_name} --- total time = {total_sec:.3} sec")

        # Add the results to the dictionary
        results_dict[model_name] = {
            "tokenizer_download_sec": tokenizer_download_sec,
            "tokenizer_load_sec": tokenizer_load_sec,
            "model_download_sec": model_download_sec,
            "model_load_to_ram_sec": model_load_to_ram_sec,
            "model_main_memory_MB": float(model_main_memory_bytes) / 1000000.0,
            "model_transfer_to_gpu_sec": model_xfer_to_gpu_sec,
            "model_gpu_memory_MB": float(model_gpu_memory_bytes) / 1000000.0,
            "inference_sec": inference_sec,
            "inference_warm_sec": inference_warm_sec,
            "total_sec": total_sec
        }

        # Unload the model
        model = None
        torch.cuda.empty_cache()

    return results_dict

# Define the main function
def main(input_file: str, output_file: str, device_name: str, repeat_infer: int):

    # Load the models to test from the input JSON file
    with open(input_file, "r") as f:
        model_dict = json.load(f)

    # Run the benchmark
    results_dict = benchmark(model_dict, device_name, repeat_infer)

    # Write the results to the JSON output file
    # use a regular expression to apply formatting to floatin point
    json_data = re.sub('"(.*?)":\s*(0\.0*\d{2}|\d+\.\d{2})\d*(,?\n)', '"\\1": \\2\\3',  json.dumps(results_dict, indent=4))
    with open(output_file, 'w') as f:
        f.write(json_data)

if __name__ == "__main__":
    # Create an argument parser
    parser = argparse.ArgumentParser(description='Benchmark downloading, loading, and running an inferernce for a set of ML models.')
    parser.add_argument('-i', '--input', required=True, help='Input JSON file containing models to be benchmark')
    parser.add_argument('-o', '--output', required=True, help='Output JSON file with model benchmark results')
    parser.add_argument('-d', '--device', required=False, default='cuda:0', help='Cuda device name, e.g. "cuda:0"')
    parser.add_argument('-r', '--repeat-infer', required=False, default=30, help='Repeat inferrence for warm timings')

    # Parse the command line arguments
    args = parser.parse_args()

    # Process the data
    main(args.input, args.output, args.device, max(args.repeat_infer, 1))

================================================
FILE: training/README.md
================================================
# OpenChatKit Training

This directory contains code for training a chat model using OpenChatKit. The main training script is `finetune_GPT-NeoXT-Chat-Base-20B.sh`.

To customize training, make a copy of the script and modify the arguments.

## Arguments

Environment vars that should be set:
```bash
export GLOO_SOCKET_IFNAME=lo # this interface should be consistent to `--net-interface`
export NCCL_SOCKET_IFNAME=lo # this interface should be consistent to `--net-interface`
export WANDB_NAME=gptj-test # wandb run name
```

The following arguments should be carefully set:
- `--model-name`: The path of model ckpt sharded by layers.
- `--tokenizer-name`: Usually the same to `--model-name`. You can also use HF's model name.
- `--model-type`: Indicate the model type. {gptj}. More model types will be added soon.
- `--num-layers`: Number of Transformer layers **for each GPU**. E.g. GPT-J has 28 layers, if we use two GPUs to form a pipeline, `--num-layers` should be 14.
- `--embedding-dim`: The hidden size of the model. GPT-J-6B is 4096. This is used to create buffers.
- `--dist-url`: URL of rank 0 worker (master). It is the same to all workers. And this URL should be accessible by all workers. For local training (single machine multiple GPUs), this can be like `--dist-url tcp://127.0.0.1:7033`
- `--world-size`: The total number of workers. `world-size == pipeline-group-size * data-group-size`
- `--pipeline-group-size`: Number of GPU workers for each pipeline
- `--data-group-size`: Number of data parallel workers. Also the number of pipelines.
- `--net-interface`: Network interface. Should be consistent with `GLOO_SOCKET_IFNAME` and `NCCL_SOCKET_IFNAME`.

The following arguments can be tuned / changed:
- `--train-log-backend `: How to log the training info. {print, loguru, wandb}.
- `--optimizer`: Optimizer type. {adam, 8bit-adam} (8bit-adam requires `pip install bitsandbytes`)
- `--load-pretrained-model`: Whether to load model weights. Usually `true`.
- `--task-name`: The task name or the path of a `jsonl` file. For multi-task training separate task names by `,`.
   There is an optional sampling weight after each task name, separated by `:` (default is 1.0). Sampling weights will be normalized.
   E.g. it should be like `--task-name cot:0.1,/path_task0.jsonl:1.0,/path_task0.jsonl:1.0,/path_task0.jsonl:1.0`.
   The number after the colon indicates the sampling weight for the task during training. For example, `cot:0.1` means the `cot` task will be sampled with a weight of 0.1.
- `--checkpoint-path`: Path to save fine-tuned checkpoints.
- `--checkpoint-steps`: Save ckpt every `checkpoint-steps`.
- `--total-steps`: Total number of steps for training. (This counts all `gradient-accumulate-step`s.)
- `--warmup-steps`: LR warmup steps.
- `--lr`: learning rate
- `--seq-length`: sequence length
- `--batch-size`: batch size for each GPU device (of each gradient accumulation step).
- `--micro-batch-size`: micro batch size for pipeline parallelism. 1 works fine.
- `--gradient-accumulate-step`: Accumulate gradients for several steps before updating parameters. This is another way to achieve large batch sizes when GPU memory is not enough.

The following arguments usually do not change:
- `--dp-backend`: {nccl, gloo}, default nccl.
- `--dp-mode`: {allreduce}.
- `--fp16`: Flag to enable FP16 mixed precision training. Should always adding it for the current impl.
- `--pp-mode`: always `gpipe`
- `--profiling`: {no-profiling, tidy_profiling}. `tidy_profiling` will generate profile jsons.

## Adding Your Own Data to the DATASETS

To add your own data to the training process, you should create a `jsonl` file where each line is a JSON object representing a single training example. Once you have your `jsonl` file, you can include it in the `--task-name` argument with an appropriate sampling weight. For instance, if your file is located at `/path_to_your_data/your_data.jsonl` and you wish to give it a sampling weight of 0.5, you would add `/path_to_your_data/your_data.jsonl:0.5` to the `--task-name` argument.

If you have any questions or need further assistance, please refer to the [OpenDataHub](https://github.com/togethercomputer/OpenDataHub) repository or contact us through our [website](https://www.together.ai/contact).


================================================
FILE: training/comm/__init__.py
================================================


================================================
FILE: training/comm/comm_utils.py
================================================
from .torch_backend import *
from .nccl_backend import *

_DATA_PARALLEL_COMM = None
_DATA_PARALLEL_RANK = None
_DATA_PARALLEL_WORLD_SIZE = None

_PIPELINE_PARALLEL_COMM = None
_PIPELINE_PARALLEL_RANK = None
_PIPELINE_PARALLEL_WORLD_SIZE = None

_TENSOR_PARALLEL_COMM = None
_TENSOR_PARALLEL_RANK = None
_TENSOR_PARALLEL_WORLD_SIZE = None

import threading 

_LOCK = threading.RLock()

def get_lock():
    return _LOCK

def get_data_parallel_comm() -> NCCLCommunicator:
    assert _DATA_PARALLEL_COMM is not None
    return _DATA_PARALLEL_COMM


def get_data_parallel_rank() -> int:
    assert _DATA_PARALLEL_RANK is not None
    return _DATA_PARALLEL_RANK


def get_data_parallel_world_size() -> int:
    assert _DATA_PARALLEL_WORLD_SIZE is not None
    return _DATA_PARALLEL_WORLD_SIZE


def get_pipeline_parallel_comm() -> NCCLCommunicator:
    assert _PIPELINE_PARALLEL_COMM is not None
    return _PIPELINE_PARALLEL_COMM


def get_pipeline_parallel_rank() -> int:
    assert _PIPELINE_PARALLEL_RANK is not None
    return _PIPELINE_PARALLEL_RANK


def get_pipeline_parallel_world_size() -> int:
    assert _PIPELINE_PARALLEL_WORLD_SIZE is not None
    return _PIPELINE_PARALLEL_WORLD_SIZE


def get_megatron_tensor_parallel_comm() -> NCCLCommunicator:
    assert _TENSOR_PARALLEL_COMM is not None
    return _TENSOR_PARALLEL_COMM


def get_megatron_tensor_parallel_rank() -> int:
    assert _TENSOR_PARALLEL_RANK is not None
    return _TENSOR_PARALLEL_RANK


def get_megatron_tensor_parallel_world_size() -> int:
    assert _TENSOR_PARALLEL_WORLD_SIZE is not None
    return _TENSOR_PARALLEL_WORLD_SIZE


def default_init(args):
    import datetime
    import time
    try:
        dist.destroy_process_group()
        # the first time will raise exception, so the following code is skipped.
        print('destroy comm, increase port for 1. (this could cause problem)')
        url = ':'.join(args.dist_url.split(':')[:-1])
        port = int(args.dist_url.split(':')[-1]) + 1
        args.dist_url = f"{url}:{port}"
        print(f"new master url: {args.dist_url}")
    except:
        pass
    dist.init_process_group(backend='gloo', timeout=datetime.timedelta(seconds=5*60), init_method=args.dist_url, world_size=args.world_size, rank=args.rank)
    

def init_communicators(args):
    default_init(args)
    assert args.world_size == args.data_group_size * args.pipeline_group_size
    if args.world_size == args.data_group_size * args.pipeline_group_size:
        #    We do the following hard code alignment of communication groups:
        #    Suppose there are 8 instances (world_size), and 4 data parallel groups (data_group_size is 2),
        #    Then there would be 2 pipeline parallel groups (pipeline_group_size is 4), then the groups will look like:
        #    pipeline parallel: <group 0: [0,1,2,3]>, <group 1: [4,5,6,7]>
        #    data parallel: <group 0: [0,4]>, <group 1: [1,5]>, <group 2: [2,6]>, <group 3: [3,7]>
        # assert args.world_size == args.data_group_size * args.pipeline_group_size
        global _DATA_PARALLEL_COMM
        global _PIPELINE_PARALLEL_COMM
        global _DATA_PARALLEL_RANK
        global _PIPELINE_PARALLEL_RANK
        global _DATA_PARALLEL_WORLD_SIZE
        global _PIPELINE_PARALLEL_WORLD_SIZE
        # We use pipeline parallel by default.
        _PIPELINE_PARALLEL_WORLD_SIZE = args.pipeline_group_size
        _PIPELINE_PARALLEL_RANK = args.rank % args.pipeline_group_size
        _PIPELINE_PARALLEL_COMM = NCCLCommunicator(_PIPELINE_PARALLEL_RANK, args.cuda_id, args.pipeline_group_size,
                                                   "pipeline_group_"+str(args.rank // args.pipeline_group_size))
        if args.data_group_size != 1:
            _DATA_PARALLEL_WORLD_SIZE = args.data_group_size
            _DATA_PARALLEL_RANK = args.rank // args.pipeline_group_size
            
            dp_backend = getattr(args, 'dp_backend', 'gloo')
            if dp_backend == 'nccl':
            
                _DATA_PARALLEL_COMM = NCCLCommunicator(_DATA_PARALLEL_RANK, args.cuda_id, args.data_group_size,
                                                       "data_group_"+str(args.rank % args.pipeline_group_size))
            
            elif dp_backend == 'gloo':
                
                for i in range(args.pipeline_group_size):
                    ranks = [rank for rank in range(i, args.world_size, args.pipeline_group_size)]
                    print(args.rank, ranks)
                    data_group = torch.distributed.new_group(ranks, backend='gloo')
                    if args.rank in ranks:
                        def to_global_rank(dp_rank):
                            rank = _PIPELINE_PARALLEL_RANK + dp_rank * args.pipeline_group_size
                            # print(f"{dp_rank} --> {rank}")
                            return rank
                        _DATA_PARALLEL_COMM = TorchCommunicator(
                            data_group, to_global_rank=to_global_rank, 
                            dp_rank=_DATA_PARALLEL_RANK,
                            comm_group_size=args.data_group_size,)
            
            else:
                assert False
            
        print('comm init done!!')
            
    # elif args.world_size == args.data_group_size * args.tensor_group_size:
    #    global _DATA_PARALLEL_COMM
    #    global _TENSOR_PARALLEL_COMM
    #    global _DATA_PARALLEL_RANK
    #    global _TENSOR_PARALLEL_RANK
    #    global _DATA_PARALLEL_WORLD_SIZE
    #    global _TENSOR_PARALLEL_WORLD_SIZE
        # We use megatron tensor parallel by default.
    #    _TENSOR_PARALLEL_WORLD_SIZE = args.tensor_group_size
    #    _TENSOR_PARALLEL_RANK = args.rank % args.tensor_group_size
    #    _TENSOR_PARALLEL_COMM = NCCLCommunicator(_TENSOR_PARALLEL_RANK, args.cuda_id, args.tensor_group_size,
    #                                             "tensor_group_" + str(args.rank // args.tensor_group_size))
    #    if args.data_group_size != 1:
    #        _DATA_PARALLEL_WORLD_SIZE = args.data_group_size
    #        _DATA_PARALLEL_RANK = args.rank // args.tensor_group_size
    #        _DATA_PARALLEL_COMM = NCCLCommunicator(_DATA_PARALLEL_RANK, args.cuda_id, args.data_group_size,
    #                                              "data_group_" + str(args.rank % args.tensor_group_size))
    else:
        print("Not supported yet")
        assert False

        
        
def reinit_dp_communicator(args):
    
    print('###### reinit start #######')
    
    default_init(args)
    assert args.world_size == args.data_group_size * args.pipeline_group_size
    if args.world_size == args.data_group_size * args.pipeline_group_size:
        #    We do the following hard code alignment of communication groups:
        #    Suppose there are 8 instances (world_size), and 4 data parallel groups (data_group_size is 2),
        #    Then there would be 2 pipeline parallel groups (pipeline_group_size is 4), then the groups will look like:
        #    pipeline parallel: <group 0: [0,1,2,3]>, <group 1: [4,5,6,7]>
        #    data parallel: <group 0: [0,4]>, <group 1: [1,5]>, <group 2: [2,6]>, <group 3: [3,7]>
        # assert args.world_size == args.data_group_size * args.pipeline_group_size
        global _DATA_PARALLEL_COMM
        global _PIPELINE_PARALLEL_COMM
        global _DATA_PARALLEL_RANK
        global _PIPELINE_PARALLEL_RANK
        global _DATA_PARALLEL_WORLD_SIZE
        global _PIPELINE_PARALLEL_WORLD_SIZE
        
        if args.data_group_size != 1:
            
            dp_backend = getattr(args, 'dp_backend', 'gloo')
            if dp_backend == 'nccl':
            
                raise Exception('NCCL cannot reinit.')
            
            elif dp_backend == 'gloo':
                
                for i in range(args.pipeline_group_size):
                    ranks = [rank for rank in range(i, args.world_size, args.pipeline_group_size)]
                    print(args.rank, ranks)
                    data_group = torch.distributed.new_group(ranks, backend='gloo')
                    if args.rank in ranks:
                        def to_global_rank(dp_rank):
                            rank = _PIPELINE_PARALLEL_RANK + dp_rank * args.pipeline_group_size
                            # print(f"{dp_rank} --> {rank}")
                            return rank
                        _DATA_PARALLEL_COMM = TorchCommunicator(
                            data_group, to_global_rank=to_global_rank, 
                            dp_rank=_DATA_PARALLEL_RANK,
                            comm_group_size=args.data_group_size,)
            
            else:
                assert False
            
        print('######## dp comm reinit done!! ########')

================================================
FILE: training/comm/nccl_backend.py
================================================
import torch
import numpy as np
import cupy
import cupy.cuda.nccl
import torch.distributed as dist
from typing import List


def _type_torch_to_cupy(torch_type: torch.dtype):
    # print(torch_type)
    mappings = {
        torch.uint8: cupy.cuda.nccl.NCCL_UINT8,
        torch.int32: cupy.cuda.nccl.NCCL_INT32,
        torch.int64: cupy.cuda.nccl.NCCL_INT64,
        torch.int: cupy.cuda.nccl.NCCL_INT,
        torch.float16: cupy.cuda.nccl.NCCL_FLOAT16,
        torch.float32: cupy.cuda.nccl.NCCL_FLOAT32,
        torch.float64: cupy.cuda.nccl.NCCL_FLOAT64,
        torch.float: cupy.cuda.nccl.NCCL_FLOAT
    }
    return mappings[torch_type]


class NCCLCommunicator:
    def __init__(self,
                 comm_rank: int,
                 cuda_id: int,
                 comm_group_size: int,
                 comm_name: str):
        self.comm_rank = comm_rank
        cupy.cuda.Device(cuda_id).use()
        self.comm_group_size = comm_group_size
        print("Initialize NCCLCommunicator: <", comm_name, ">; rank:", comm_rank)
        self.dist_store = dist.distributed_c10d._get_default_store()

        if self.comm_rank == 0:
            cuda_id = cupy.cuda.nccl.get_unique_id()
            # print(cuda_id)
            cuda_id_str = np.array(cuda_id).tobytes()
            self.dist_store.set('group-'+comm_name+'-unique-id', cuda_id_str)
            # print("Master put <group-"+comm_name+"-unique-id: ", cuda_id_str, ">.")
        else:
            cuda_id_str = self.dist_store.get('group-'+comm_name+'-unique-id')

        comm_id = tuple(np.frombuffer(cuda_id_str, dtype=int))
        # comm_id = cupy.cuda.nccl.get_unique_id()
        # print(comm_id)
        self.comm = cupy.cuda.nccl.NcclCommunicator(comm_group_size, comm_id, comm_rank)

    @staticmethod
    def barrier():
        dist.barrier()

    def store_set(self, key, value):
        self.dist_store.set(key, value)

    def store_get(self, key):
        return self.dist_store.get(key)

    def send(self,
             tensor: torch.Tensor,
             dst: int,
             stream=cupy.cuda.Stream.null):
        # print("Send tensor of size:", torch.numel(tensor))
        self.comm.send(
            tensor.data_ptr(),
            torch.numel(tensor),
            _type_torch_to_cupy(tensor.dtype),
            dst,
            stream.ptr
        )

    def recv(self,
             tensor: torch.Tensor,
             src: int,
             stream=cupy.cuda.Stream.null):
        # print("Recv tensor of size:", torch.numel(tensor))
        # print("mean:", torch.mean(tensor).item(), " std:", torch.std(tensor).item())
        self.comm.recv(
            tensor.data_ptr(),
            torch.numel(tensor),
            _type_torch_to_cupy(tensor.dtype),
            src,
            stream.ptr
        )

    def broadcast(self,
                  tensor: torch.Tensor,
                  src: int,
                  stream=cupy.cuda.Stream.null):
        self.comm.bcast(
            tensor.data_ptr(),
            torch.numel(tensor),
            _type_torch_to_cupy(tensor.dtype),
            src,
            stream.ptr
        )

    def reduce(self,
               tensor: torch.Tensor,
               dst: int,
               stream=cupy.cuda.Stream.null,
               op=cupy.cuda.nccl.NCCL_SUM):
        self.comm.reduce(
            tensor.data_ptr(),  # force it to be in-place.
            tensor.data_ptr(),
            torch.numel(tensor),
            _type_torch_to_cupy(tensor.dtype),
            op,
            dst,
            stream.ptr
        )

    def all_reduce(self,
                  tensor: torch.Tensor,
                  stream=cupy.cuda.Stream.null,
                  op=cupy.cuda.nccl.NCCL_SUM):
        self.comm.allReduce(
            tensor.data_ptr(),
            tensor.data_ptr(),
            torch.numel(tensor),
            _type_torch_to_cupy(tensor.dtype),
            op,
            stream.ptr
        )

    def scatter(self,
                tensor: torch.Tensor,
                scatter_list: List[torch.Tensor],
                src: int,
                stream=cupy.cuda.Stream.null):
        cupy.cuda.nccl.groupStart()
        if self.comm_rank == src:
            for i in range(self.comm_group_size):
                self.send(
                    scatter_list[i],
                    i,
                    stream
                )
        self.recv(
            tensor,
            src,
            stream
        )
        cupy.cuda.nccl.groupEnd()

    def gather(self,
               tensor: torch.Tensor,
               gather_list: List[torch.Tensor],
               dst: int,
               stream=cupy.cuda.Stream.null):
        cupy.cuda.nccl.groupStart()
        if self.comm_rank == dst:
            for i in range(self.comm_group_size):
                self.recv(
                    gather_list[i],
                    i,
                    stream
                )
        self.send(
            tensor,
            dst,
            stream
        )
        cupy.cuda.nccl.groupEnd()

    def all_to_all(self,
                   output_tensor_list: List[torch.Tensor],
                   input_tensor_list: List[torch.Tensor],
                   stream=cupy.cuda.Stream.null):
        assert len(output_tensor_list) == self.comm_group_size and len(input_tensor_list) == self.comm_group_size
        cupy.cuda.nccl.groupStart()
        for i in range(self.comm_group_size):
            self.send(input_tensor_list[i], i, stream)
            self.recv(output_tensor_list[i], i, stream)
        cupy.cuda.nccl.groupEnd()

    def all_gather(self,
                   tensor: torch.Tensor,
                   output_tensor_list: List[torch.Tensor],
                   stream=cupy.cuda.Stream.null
                   ):
        assert len(output_tensor_list) == self.comm_group_size
        cupy.cuda.nccl.groupStart()
        for i in range(self.comm_group_size):
            self.send(tensor, i, stream)
            self.recv(output_tensor_list[i], i, stream)
        cupy.cuda.nccl.groupEnd()

    def all_reduce_opt(self,
                       tensor: torch.Tensor,
                       buffer: List[torch.Tensor],
                       stream=cupy.cuda.Stream.null,
                       caller=None):
        # First do all-to-all
        assert torch.numel(tensor.data) % self.comm_group_size == 0
        chunk_size = torch.numel(tensor.data) // self.comm_group_size
        t_type = _type_torch_to_cupy(tensor.dtype)
        element_size = tensor.data.element_size()
        
        cupy.cuda.nccl.groupStart()
        for i in range(self.comm_group_size):
            self.comm.send(tensor.data_ptr()+i*chunk_size*element_size, chunk_size, t_type, i, stream.ptr)
            self.comm.recv(buffer[i].data_ptr(), chunk_size, t_type, i, stream.ptr)
        cupy.cuda.nccl.groupEnd()
        
        for i in range(1, self.comm_group_size):
            buffer[0] += buffer[i]

        cupy.cuda.nccl.groupStart()
        for i in range(self.comm_group_size):
            self.comm.send(buffer[0].data_ptr(), chunk_size, t_type, i, stream.ptr)
            self.comm.recv(tensor.data_ptr()+i*chunk_size*element_size, chunk_size, t_type, i, stream.ptr)
        cupy.cuda.nccl.groupEnd()



================================================
FILE: training/comm/torch_backend.py
================================================
import torch
import torch.distributed as dist
from typing import List

class TorchCommunicator:
        
    def __init__(self,
                 process_group,
                 to_global_rank=lambda rank: rank,
                 dp_rank=None,
                 comm_group_size=None,):
        self.process_group = process_group
        self.to_global_rank = to_global_rank
        self.dp_rank = dp_rank
        self.comm_group_size = comm_group_size

    # @staticmethod
    def barrier(self):
        dist.barrier(group=self.process_group)

    def send(self,
             tensor: torch.Tensor,
             dst: int,
             stream=None):
        # print("Send tensor of size:", torch.numel(tensor))
        if tensor.device == torch.device('cpu'):
            dist.send(tensor, self.to_global_rank(dst), group=self.process_group)
        else:
            dist.send(tensor.cpu(), self.to_global_rank(dst), group=self.process_group)
            
    def recv(self,
             tensor: torch.Tensor,
             src: int,
             stream=None):
        
        if tensor.device == torch.device('cpu'):
            dist.recv(tensor, self.to_global_rank(src), group=self.process_group)
        else:
            buffer = tensor.cpu()
            dist.recv(buffer, self.to_global_rank(src), group=self.process_group)
            tensor[:] = buffer.to(tensor.device)
    
    def isend(self,
             tensor: torch.Tensor,
             dst: int,
             stream=None):
        # print("Send tensor of size:", torch.numel(tensor))
        if tensor.device == torch.device('cpu'):
            handler = dist.isend(tensor, self.to_global_rank(dst), group=self.process_group)
        else:
            handler = dist.isend(tensor.cpu(), self.to_global_rank(dst), group=self.process_group)
        return handler

    def irecv(self,
             tensor: torch.Tensor,
             src: int,
             stream=None):
        if tensor.device == torch.device('cpu'):
            handler = dist.irecv(tensor, self.to_global_rank(src), group=self.process_group)
        else:
            assert False
            buffer = tensor.cpu()
            handler = dist.irecv(buffer, self.to_global_rank(src), group=self.process_group)
            tensor[:] = buffer.to(tensor.device)
        return handler

    def broadcast(self,
                  tensor: torch.Tensor,
                  src: int,
                  stream=None):
        if tensor.device == torch.device('cpu'):
            dist.broadcast(tensor, self.to_global_rank(src), group=self.process_group)
        else:
            buffer = tensor.cpu()
            dist.broadcast(buffer, self.to_global_rank(src), group=self.process_group)
            tensor[:] = buffer.to(tensor.device)

    def reduce(self,
               tensor: torch.Tensor,
               dst: int,
               stream=None,
               op=dist.ReduceOp.SUM):
        dist.reduce(tensor, self.to_global_rank(dst), group=self.process_group, op=op)

    def all_reduce(self,
                   tensor: torch.Tensor,
                   stream = None,
                   op=dist.ReduceOp.SUM):
        buffer = tensor.cpu()
        dist.all_reduce(buffer, group=self.process_group, op=op)
        tensor[:] = buffer.to(tensor.device)

    def gather(self,
               tensor: torch.Tensor,
               gather_list: List[torch.Tensor],
               dst: int,
               stream=None):
        dist.gather(tensor, gather_list, self.to_global_rank(dst), group=self.process_group)

    def all_to_all(self,
                   output_tensor_list: List[torch.Tensor],
                   input_tensor_list: List[torch.Tensor],
                   stream=None):
        dist.all_to_all(output_tensor_list, input_tensor_list, group=self.process_group)

    def all_gather(self,
                   tensor: torch.Tensor,
                   output_tensor_list: List[torch.Tensor],
                   stream=None):
        dist.all_gather(output_tensor_list, tensor, group=self.process_group)
        


================================================
FILE: training/data_parallel/__init__.py
================================================


================================================
FILE: training/data_parallel/dist_dp_allreduce.py
================================================
import torch.cuda
from comm.comm_utils import *
from .flatten_utils import flatten_params


class AllReduceDP:
    def __init__(self, args, device, module: torch.nn.Module, optimizer: torch.optim.Optimizer = None, flatten=True):
        self.flatten = flatten
        self.global_rank = args.rank
        self.dp_group_size = args.data_group_size
        self.enable_tidy_profiling = (args.profiling == 'tidy_profiling')
        self.dp_comm = get_data_parallel_comm()
        self.dp_rank = get_data_parallel_rank()
        self.dp_comm_stream = torch.cuda.Stream(device=device, priority=-1)
        self.torch_optim_comp_stream = torch.cuda.default_stream(device=device)
        self.backward_ready_event = torch.cuda.Event(enable_timing=self.enable_tidy_profiling, blocking=False)
        self.allreduce_grad_ready_event = torch.cuda.Event(enable_timing=self.enable_tidy_profiling, blocking=False)
        self.optimizer_step_ready_event = torch.cuda.Event(enable_timing=self.enable_tidy_profiling, blocking=False)

        self.module = module
        num_paras, element_size = self._compute_total_para_num()
        print("Total number of parameters: {}, element size: {}, total size {} MB."
              .format(num_paras, element_size, num_paras * element_size // 1024 // 1024))

        if self.flatten:
            self.flatten_para = flatten_params(self.module.parameters())
            print("Flattened parameter number: {}, element size: {}."
                  .format(self.flatten_para.data.numel(), self.flatten_para.data.element_size()))
            print("Flattened parameter grad number: {}, element size: {}."
                  .format(self.flatten_para.grad.numel(), self.flatten_para.grad.element_size()))

        assert optimizer is not None
        self.optimizer = optimizer

        if self.enable_tidy_profiling:
            self.global_rank = args.rank
            self.init_event = None
            self.init_time_stamp = None
            if self.flatten:
                self.allreduce_gradients_start_event = torch.cuda.Event(enable_timing=True, blocking=False)
            else:
                self.allreduce_gradients_start_events = dict()
                self.allreduce_gradients_end_events = dict()
                for name, _ in self.module.named_parameters():
                    self.allreduce_gradients_start_events[name] = torch.cuda.Event(enable_timing=True, blocking=False)
                    self.allreduce_gradients_end_events[name] = torch.cuda.Event(enable_timing=True, blocking=False)

            self.optimizer_step_start_event = torch.cuda.Event(enable_timing=self.enable_tidy_profiling,
                                                               blocking=False)

    def _compute_total_para_num(self):
        total_count = 0
        element_size = 0
        for para in self.module.parameters():
            # print("Parameter: ", para.data.shape)
            total_count += torch.numel(para.data)
            element_size = para.element_size()
        return total_count, element_size

    def profile_mark_allreduce_start(self, name=None):
        if self.enable_tidy_profiling:
            if name is None:
                self.dp_comm_stream.record_event(self.allreduce_gradients_start_event)
            else:
                self.dp_comm_stream.record_event(self.allreduce_gradients_start_events[name])

    def profile_mark_allreduce_end(self, name=None):
        if self.enable_tidy_profiling:
            if name:
                self.dp_comm_stream.record_event(self.allreduce_gradients_end_events[name])

    def profile_mark_optimizer_step_start(self):
        if self.enable_tidy_profiling:
            self.torch_optim_comp_stream.record_event(self.optimizer_step_start_event)

    def _allreduce_gradients(self):
        with torch.cuda.stream(self.dp_comm_stream):
            cupy_dp_stream = cupy.cuda.ExternalStream(self.dp_comm_stream.cuda_stream)
            self.dp_comm_stream.wait_event(self.backward_ready_event)
            if self.flatten:
                self.profile_mark_allreduce_start()
                self.dp_comm.all_reduce(self.flatten_para.grad, stream=cupy_dp_stream)
                self.profile_mark_allreduce_end()
            else:
                for name, para in self.module.named_parameters():
                    if para.grad is None:
                        continue
                    self.profile_mark_allreduce_start(name)
                    self.dp_comm.all_reduce(para.grad, stream=cupy_dp_stream)
                    self.profile_mark_allreduce_end(name)
            self.dp_comm_stream.record_event(self.allreduce_grad_ready_event)

    def optimizer_step(self):
        self._allreduce_gradients()
        with torch.cuda.stream(self.torch_optim_comp_stream):
            self.torch_optim_comp_stream.wait_event(self.allreduce_grad_ready_event)
            self.profile_mark_optimizer_step_start()
            self.optimizer.step()
            self.torch_optim_comp_stream.record_event(self.optimizer_step_ready_event)

    def set_time_stamp(self, init_time_stamp, init_event):
        self.init_event = init_event
        self.init_time_stamp = init_time_stamp

    def get_ts(self, event):
        return self.init_time_stamp + self.init_event.elapsed_time(event) * 1e+3

    def profiling_data_parallel(self, init_time_stamp, init_event):
        self.set_time_stamp(init_time_stamp, init_event)
        profiling_log = []

        if self.flatten:
            allreduce_slot = self.allreduce_gradients_start_event.elapsed_time(self.allreduce_grad_ready_event)*1e+3
            allreduce_log = {"name": "opt_allreduce", "ph": "X", "pid": self.global_rank, "tid": "7. optimizer-comm",
                             "ts": self.get_ts(self.allreduce_gradients_start_event),
                             "dur": allreduce_slot, "cname": "cq_build_passed",
                             "args": {'para': 'flattened_grad', 'size': self.flatten_para.grad.numel()}}
            # print(allreduce_log)
            profiling_log.append(allreduce_log)
        else:
            for name, para in self.module.named_parameters():
                allreduce_slot = self.allreduce_gradients_start_events[name].elapsed_time(
                    self.allreduce_gradients_end_events[name]) * 1e+3
                allreduce_log = {"name": "opt_allreduce", "ph": "X", "pid": self.global_rank, "tid": "7. optimizer-comm",
                                 "ts": self.get_ts(self.allreduce_gradients_start_events[name]), "dur": allreduce_slot,
                                 "cname": "cq_build_passed", "args": {'para': name, 'size': torch.numel(para.data)}}
                # print(allreduce_log)
                profiling_log.append(allreduce_log)

        optimizer_slot = self.optimizer_step_start_event.elapsed_time(self.optimizer_step_ready_event) * 1e+3
        optimizer_log = {"name": "opt_comp", "ph": "X", "pid": self.global_rank, "tid": "8. optimizer-comp",
                         "ts": self.get_ts(self.optimizer_step_start_event), "dur": optimizer_slot, "cname": "bad"}
        # print(optimizer_log)
        profiling_log.append(optimizer_log)
        return profiling_log


================================================
FILE: training/data_parallel/dist_dp_central_ps.py
================================================
import torch.cuda
from comm.comm_utils import *
from .flatten_utils import flatten_params


class CentralPSDP:
    def __init__(self, args, device, module: torch.nn.Module, optimizer: torch.optim.Optimizer = None, flatten=True):
        self.flatten = flatten
        self.global_rank = args.rank
        self.dp_group_size = args.data_group_size
        self.enable_tidy_profiling = (args.profiling == 'tidy_profiling')
        self.dp_comm = get_data_parallel_comm()
        self.dp_rank = get_data_parallel_rank()
        self.dp_comm_stream = torch.cuda.Stream(device=device, priority=-1)
        self.torch_optim_comp_stream = torch.cuda.default_stream(device=device)
        self.backward_ready_event = torch.cuda.Event(enable_timing=self.enable_tidy_profiling, blocking=False)
        self.broadcast_reduced_gradients_ready_event = torch.cuda.Event(enable_timing=self.enable_tidy_profiling,
                                                                        blocking=False)
        self.optimizer_step_ready_event = torch.cuda.Event(enable_timing=self.enable_tidy_profiling, blocking=False)

        self.module = module
        num_paras, element_size = self._compute_total_para_num()
        print("Total number of parameters: {}, element size: {}, total size {} MB."
              .format(num_paras, element_size, num_paras * element_size // 1024 // 1024))

        if self.flatten:
            self.flatten_para = flatten_params(self.module.parameters())
            print("Flattened parameter number: {}, element size: {}."
                  .format(self.flatten_para.data.numel(), self.flatten_para.data.element_size()))
            print("Flattened parameter grad number: {}, element size: {}."
                  .format(self.flatten_para.grad.numel(), self.flatten_para.grad.element_size()))

        assert optimizer is not None
        self.optimizer = optimizer

        if self.enable_tidy_profiling:
            self.global_rank = args.rank
            self.init_event = None
            self.init_time_stamp = None
            if self.flatten:
                self.reduce_gradients_start_event = torch.cuda.Event(enable_timing=True, blocking=False)
                self.reduce_gradients_end_event = torch.cuda.Event(enable_timing=True, blocking=False)
                self.broadcast_reduced_grad_start_event = torch.cuda.Event(enable_timing=True, blocking=False)
            else:
                self.reduce_gradients_start_events = dict()
                self.reduce_gradients_end_events = dict()
                self.broadcast_reduced_grad_start_events = dict()
                self.broadcast_reduced_grad_end_events = dict()

                for name, _ in self.module.named_parameters():
                    self.reduce_gradients_start_events[name] = torch.cuda.Event(enable_timing=True, blocking=False)
                    self.reduce_gradients_end_events[name] = torch.cuda.Event(enable_timing=True, blocking=False)
                    self.broadcast_reduced_grad_start_events[name] = torch.cuda.Event(enable_timing=True, blocking=False)
                    self.broadcast_reduced_grad_end_events[name] = torch.cuda.Event(enable_timing=True, blocking=False)

            self.optimizer_step_start_event = torch.cuda.Event(enable_timing=True, blocking=False)

    def _compute_total_para_num(self):
        total_count = 0
        element_size = 0
        for para in self.module.parameters():
            # print("Parameter: ", para.data.shape)
            total_count += torch.numel(para.data)
            element_size = para.element_size()
        return total_count, element_size
    
    def profile_mark_reduce_start(self, name=None):
        if self.enable_tidy_profiling:
            if name is None:
                self.dp_comm_stream.record_event(self.reduce_gradients_start_event)
            else:
                self.dp_comm_stream.record_event(self.reduce_gradients_start_events[name])

    def profile_mark_reduce_end(self, name=None):
        if self.enable_tidy_profiling:
            if name is None:
                self.dp_comm_stream.record_event(self.reduce_gradients_end_event)
            else:
                self.dp_comm_stream.record_event(self.reduce_gradients_end_events[name])

    def profile_mark_optimizer_step_start(self):
        if self.enable_tidy_profiling:
            self.torch_optim_comp_stream.record_event(self.optimizer_step_start_event)
            
    def profile_mark_broadcast_start(self, name=None):
        if self.enable_tidy_profiling:
            if name is None:
                self.dp_comm_stream.record_event(self.broadcast_reduced_grad_start_event)
            else:
                self.dp_comm_stream.record_event(self.broadcast_reduced_grad_start_events[name])
            
    def profile_mark_broadcast_end(self, name=None):
        if self.enable_tidy_profiling:
            if name:
                self.dp_comm_stream.record_event(self.broadcast_reduced_grad_end_events[name])

    def _reduce_gradients(self):
        with torch.cuda.stream(self.dp_comm_stream):
            cupy_dp_stream = cupy.cuda.ExternalStream(self.dp_comm_stream.cuda_stream)
            self.dp_comm_stream.wait_event(self.backward_ready_event)
            if self.flatten:
                self.profile_mark_reduce_start()
                self.dp_comm.reduce(self.flatten_para.grad, dst=0, stream=cupy_dp_stream)
                self.profile_mark_reduce_end()
            else:
                for name, para in self.module.named_parameters():
                    self.profile_mark_reduce_start(name)
                    self.dp_comm.reduce(para.grad, dst=0, stream=cupy_dp_stream)
                    self.profile_mark_reduce_end(name)

    def _broadcast_reduced_gradients(self):
        with torch.cuda.stream(self.dp_comm_stream):
            cupy_dp_stream = cupy.cuda.ExternalStream(self.dp_comm_stream.cuda_stream)
            if self.flatten:
                self.profile_mark_broadcast_start()
                self.dp_comm.broadcast(self.flatten_para.grad, src=0, stream=cupy_dp_stream)
                self.profile_mark_broadcast_end()
            else:
                for name, para in self.module.named_parameters():
                    self.profile_mark_broadcast_start(name)
                    self.dp_comm.broadcast(para.grad, src=0, stream=cupy_dp_stream)
                    self.profile_mark_broadcast_end(name)
            self.dp_comm_stream.record_event(self.broadcast_reduced_gradients_ready_event)

    def optimizer_step(self):
        self._reduce_gradients()
        self._broadcast_reduced_gradients()
        with torch.cuda.stream(self.torch_optim_comp_stream):
            self.torch_optim_comp_stream.wait_event(self.broadcast_reduced_gradients_ready_event)
            self.profile_mark_optimizer_step_start()
            self.optimizer.step()
            self.torch_optim_comp_stream.record_event(self.optimizer_step_ready_event)

    def set_time_stamp(self, init_time_stamp, init_event):
        self.init_event = init_event
        self.init_time_stamp = init_time_stamp

    def get_ts(self, event):
        return self.init_time_stamp + self.init_event.elapsed_time(event) * 1e+3

    def profiling_data_parallel(self, init_time_stamp, init_event):
        self.set_time_stamp(init_time_stamp, init_event)
        profiling_log = []
        if self.flatten:
            reduce_slot = self.reduce_gradients_start_event.elapsed_time(self.reduce_gradients_end_event) * 1e+3
            reduce_log = {"name": "opt_reduce", "ph": "X", "pid": self.global_rank, "tid": "7. optimizer-comm",
                          "ts": self.get_ts(self.reduce_gradients_start_event),
                          "dur": reduce_slot, "cname": "cq_build_passed",
                          "args": {'para': 'flattened_grad', 'size': self.flatten_para.grad.numel()}}
            # print(reduce_log)
            profiling_log.append(reduce_log)
        else:
            for name, para in self.module.named_parameters():
                reduce_slot = self.reduce_gradients_start_events[name].elapsed_time(
                    self.reduce_gradients_end_events[name]) * 1e+3
                reduce_log = {"name": "opt_reduce", "ph": "X", "pid": self.global_rank, "tid": "7. optimizer-comm",
                              "ts": self.get_ts(self.reduce_gradients_start_events[name]), "dur": reduce_slot,
                              "cname": "cq_build_passed", "args": {'para': name, 'size': torch.numel(para.data)}}
                # print(reduce_log)
                profiling_log.append(reduce_log)

        optimizer_slot = self.optimizer_step_start_event.elapsed_time(self.optimizer_step_ready_event) * 1e+3
        optimizer_log = {"name": "opt_comp", "ph": "X", "pid": self.global_rank, "tid": "8. optimizer-comp",
                         "ts": self.get_ts(self.optimizer_step_start_event), "dur": optimizer_slot, "cname": "bad"}
        # print(optimizer_log)
        profiling_log.append(optimizer_log)

        if self.flatten:
            broadcast_slot = self.broadcast_reduced_grad_start_event.elapsed_time(
                self.broadcast_reduced_gradients_ready_event) * 1e+3
            broadcast_log = {"name": "opt_broadcast", "ph": "X", "pid": self.global_rank, "tid": "7. optimizer-comm",
                             "ts": self.get_ts(self.broadcast_reduced_grad_start_event),
                             "dur": broadcast_slot, "cname": "cq_build_passed",
                             "args": {'para': 'flattened_grad', 'size': self.flatten_para.grad.numel()}}
            profiling_log.append(broadcast_log)
        else:
            for name, para in self.module.named_parameters():
                broadcast_slot = self.broadcast_reduced_grad_start_events[name].elapsed_time(
                    self.broadcast_reduced_grad_end_events[name]) * 1e+3
                broadcast_log = {"name": "opt_broadcast", "ph": "X", "pid": self.global_rank, "tid": "7. optimizer-comm",
                                 "ts": self.get_ts(self.broadcast_reduced_grad_start_events[name]), "dur": broadcast_slot,
                                 "cname": "cq_build_passed", "args": {'para': name, 'size': torch.numel(para.data)}}
                # print(broadcast_log)
                profiling_log.append(broadcast_log)
        return profiling_log


================================================
FILE: training/data_parallel/dist_dp_local.py
================================================
import torch.cuda
import cupy
from comm.comm_utils import *
from .flatten_utils import flatten_params


class LocalDP:
    def __init__(self, args, device, module: torch.nn.Module, optimizer: torch.optim.Optimizer = None, flatten=True):
        flatten = True
        self.flatten = flatten
        self.global_rank = args.rank
        self.dp_group_size = args.data_group_size
        self.enable_tidy_profiling = (args.profiling == 'tidy_profiling')
        self.dp_comm = get_data_parallel_comm()
        self.dp_rank = get_data_parallel_rank()
        self.dp_comm_stream = torch.cuda.Stream(device=device, priority=-1)
        self.torch_optim_comp_stream = torch.cuda.default_stream(device=device)
        self.backward_ready_event = torch.cuda.Event(enable_timing=self.enable_tidy_profiling, blocking=False)
        self.allreduce_gradients_start_event = torch.cuda.Event(enable_timing=self.enable_tidy_profiling, blocking=False)
        self.allreduce_grad_ready_event = torch.cuda.Event(enable_timing=self.enable_tidy_profiling, blocking=False)
        self.optimizer_step_ready_event = torch.cuda.Event(enable_timing=self.enable_tidy_profiling, blocking=False)

        self.module = module
        num_paras, element_size = self._compute_total_para_num()
        print("Total number of parameters: {}, element size: {}, total size {} MB."
              .format(num_paras, element_size, num_paras * element_size // 1024 // 1024))

        if self.flatten:
            self.flatten_para = flatten_params(self.module.parameters())
            print("Flattened parameter number: {}, element size: {}."
                  .format(self.flatten_para.data.numel(), self.flatten_para.data.element_size()))
            print("Flattened parameter grad number: {}, element size: {}."
                  .format(self.flatten_para.grad.numel(), self.flatten_para.grad.element_size()))

        assert optimizer is not None
        self.optimizer = optimizer

        if self.enable_tidy_profiling:
            self.global_rank = args.rank
            self.init_event = None
            self.init_time_stamp = None
            if self.flatten:
                self.allreduce_gradients_start_event = torch.cuda.Event(enable_timing=True, blocking=False)
            else:
                self.allreduce_gradients_start_events = dict()
                self.allreduce_gradients_end_events = dict()
                for name, _ in self.module.named_parameters():
                    self.allreduce_gradients_start_events[name] = torch.cuda.Event(enable_timing=True, blocking=False)
                    self.allreduce_gradients_end_events[name] = torch.cuda.Event(enable_timing=True, blocking=False)

            self.optimizer_step_start_event = torch.cuda.Event(enable_timing=self.enable_tidy_profiling,
                                                               blocking=False)

    def _compute_total_para_num(self):
        total_count = 0
        element_size = 0
        for para in self.module.parameters():
            # print("Parameter: ", para.data.shape)
            total_count += torch.numel(para.data)
            element_size = para.element_size()
        return total_count, element_size

    def profile_mark_allreduce_start(self, name=None):
        if self.enable_tidy_profiling:
            if name is None:
                self.dp_comm_stream.record_event(self.allreduce_gradients_start_event)
            else:
                self.dp_comm_stream.record_event(self.allreduce_gradients_start_events[name])

    def profile_mark_allreduce_end(self, name=None):
        if self.enable_tidy_profiling:
            if name:
                self.dp_comm_stream.record_event(self.allreduce_gradients_end_events[name])

    def profile_mark_optimizer_step_start(self):
        if self.enable_tidy_profiling:
            self.torch_optim_comp_stream.record_event(self.optimizer_step_start_event)
            
    def allreduce_parameters(self):
        self._local_parameters_backup = [
            p.data.clone() for p in self.module.parameters()
        ]
        torch.cuda.synchronize()
        self.dp_comm.barrier()
        with torch.cuda.stream(self.dp_comm_stream):
            cupy_dp_stream = cupy.cuda.ExternalStream(self.dp_comm_stream.cuda_stream)
            self.dp_comm_stream.wait_event(self.backward_ready_event)
            if self.flatten:
                self.profile_mark_allreduce_start()
                self.dp_comm.all_reduce(self.flatten_para.data, stream=cupy_dp_stream)
                self.flatten_para.data /= self.dp_group_size
                self.profile_mark_allreduce_end()
            else:
                for name, para in self.module.named_parameters():
                    self.profile_mark_allreduce_start(name)
                    self.dp_comm.all_reduce(para.data, stream=cupy_dp_stream)
                    para.data /= self.dp_group_size
                    self.profile_mark_allreduce_end(name)
            self.dp_comm_stream.record_event(self.allreduce_grad_ready_event)
        torch.cuda.synchronize()
        self.dp_comm.barrier()

    def rollback_parameters(self):
        if not hasattr(self, '_local_parameters_backup'):
            return
        
        for p, p_local in zip(self.module.parameters(), self._local_parameters_backup):
            p.data[:] = p_local.data
            
        del self._local_parameters_backup
            

    def optimizer_step(self):
        # torch.cuda.synchronize()
        with torch.cuda.stream(self.torch_optim_comp_stream):
            self.torch_optim_comp_stream.record_event(self.allreduce_gradients_start_event)
            self.torch_optim_comp_stream.record_event(self.allreduce_grad_ready_event)
            self.torch_optim_comp_stream.wait_event(self.backward_ready_event)
            self.profile_mark_optimizer_step_start()
            self.optimizer.step()
            self.torch_optim_comp_stream.record_event(self.optimizer_step_ready_event)

    def set_time_stamp(self, init_time_stamp, init_event):
        self.init_event = init_event
        self.init_time_stamp = init_time_stamp

    def get_ts(self, event):
        return self.init_time_stamp + self.init_event.elapsed_time(event) * 1e+3

    def profiling_data_parallel(self, init_time_stamp, init_event):
        self.set_time_stamp(init_time_stamp, init_event)
        profiling_log = []

        if self.flatten:
            allreduce_slot = self.allreduce_gradients_start_event.elapsed_time(self.allreduce_grad_ready_event)*1e+3
            allreduce_log = {"name": "opt_allreduce", "ph": "X", "pid": self.global_rank, "tid": "7. optimizer-comm",
                             "ts": self.get_ts(self.allreduce_gradients_start_event),
                             "dur": allreduce_slot, "cname": "cq_build_passed",
                             "args": {'para': 'flattened_grad', 'size': self.flatten_para.grad.numel()}}
            # print(allreduce_log)
            profiling_log.append(allreduce_log)
        else:
            for name, para in self.module.named_parameters():
                allreduce_slot = self.allreduce_gradients_start_events[name].elapsed_time(
                    self.allreduce_gradients_end_events[name]) * 1e+3
                allreduce_log = {"name": "opt_allreduce", "ph": "X", "pid": self.global_rank, "tid": "7. optimizer-comm",
                                 "ts": self.get_ts(self.allreduce_gradients_start_events[name]), "dur": allreduce_slot,
                                 "cname": "cq_build_passed", "args": {'para': name, 'size': torch.numel(para.data)}}
                # print(allreduce_log)
                profiling_log.append(allreduce_log)

        optimizer_slot = self.optimizer_step_start_event.elapsed_time(self.optimizer_step_ready_event) * 1e+3
        optimizer_log = {"name": "opt_comp", "ph": "X", "pid": self.global_rank, "tid": "8. optimizer-comp",
                         "ts": self.get_ts(self.optimizer_step_start_event), "dur": optimizer_slot, "cname": "bad"}
        # print(optimizer_log)
        profiling_log.append(optimizer_log)
        return profiling_log


================================================
FILE: training/data_parallel/dist_dp_sharded_ps.py
================================================
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.cuda
from comm.comm_utils import *
from .flatten_utils import flatten_params


class ShardedPSDP:
    def __init__(self, args, device, module: torch.nn.Module, optimizer: torch.optim.Optimizer = None, flatten=True):
        self.flatten = flatten
        self.global_rank = args.rank
        self.dp_group_size = args.data_group_size
        self.enable_tidy_profiling = (args.profiling == 'tidy_profiling')
        self.dp_comm = get_data_parallel_comm()
        self.dp_rank = get_data_parallel_rank()
        self.dp_comm_stream = torch.cuda.Stream(device=device, priority=-1)
        self.torch_optim_comp_stream = torch.cuda.default_stream(device=device)
        self.backward_ready_event = torch.cuda.Event(enable_timing=self.enable_tidy_profiling, blocking=False)
        self.sync_gradients_ready_event = torch.cuda.Event(enable_timing=self.enable_tidy_profiling, blocking=False)
        self.optimizer_step_ready_event = torch.cuda.Event(enable_timing=self.enable_tidy_profiling, blocking=False)

        self.module = module
        assert optimizer is not None
        self.optimizer = optimizer
        num_paras, element_size = self._compute_total_para_num()
        print("Total number of parameters: {}, element size: {}, total size {} MB."
              .format(num_paras, element_size, num_paras * element_size // 1024 // 1024))

        assert self.flatten
#         self.para = list(self.module.parameters())
        self.flatten_para = flatten_params(self.module.parameters(), self.dp_group_size)
        print("Flattened parameter number: {}, element size: {}."
              .format(self.flatten_para.data.numel(), self.flatten_para.data.element_size()))
        print("Flattened parameter grad number: {}, element size: {}."
              .format(self.flatten_para.grad.numel(), self.flatten_para.grad.element_size()))

        self.grad_buffer = self._declare_grad_buffer()

        if self.enable_tidy_profiling:
            self.global_rank = args.rank
            self.init_event = None
            self.init_time_stamp = None

            assert self.flatten
            self.sync_gradients_start_event = torch.cuda.Event(enable_timing=True, blocking=False)

            self.optimizer_step_start_event = torch.cuda.Event(enable_timing=True, blocking=False)

    def _compute_total_para_num(self):
        total_count = 0
        element_size = 0
        for para in self.module.parameters():
            # print("Parameter: ", para.data.shape)
            total_count += torch.numel(para.data)
            element_size = para.element_size()
        return total_count, element_size

    def _declare_grad_buffer(self):
        assert self.flatten_para.data.numel() % self.dp_group_size == 0
        chunk_size = self.flatten_para.data.numel() // self.dp_group_size
        grad_buffer = [torch.zeros(chunk_size, device=self.flatten_para.device, dtype=self.flatten_para.dtype)
                       for _ in range(self.dp_group_size)]
        return grad_buffer

    def profile_mark_sync_grad_start(self):
        if self.enable_tidy_profiling:
            self.dp_comm_stream.record_event(self.sync_gradients_start_event)

    def profile_mark_allreduce_end(self):
        pass

    def profile_mark_optimizer_step_start(self):
        if self.enable_tidy_profiling:
            self.torch_optim_comp_stream.record_event(self.optimizer_step_start_event)

    def _sync_gradients(self):
        with torch.cuda.stream(self.dp_comm_stream):
            cupy_dp_stream = cupy.cuda.ExternalStream(self.dp_comm_stream.cuda_stream)
            self.dp_comm_stream.wait_event(self.backward_ready_event)
            assert self.flatten
            self.profile_mark_sync_grad_start()
            self.dp_comm.all_reduce_opt(self.flatten_para.grad, self.grad_buffer, stream=cupy_dp_stream)
            self.profile_mark_allreduce_end()
            self.dp_comm_stream.record_event(self.sync_gradients_ready_event)

    def optimizer_step(self):
        self._sync_gradients()
        with torch.cuda.stream(self.torch_optim_comp_stream):
            self.torch_optim_comp_stream.wait_event(self.sync_gradients_ready_event)
            self.profile_mark_optimizer_step_start()
            self.optimizer.step()
            self.torch_optim_comp_stream.record_event(self.optimizer_step_ready_event)

    def set_time_stamp(self, init_time_stamp, init_event):
        self.init_event = init_event
        self.init_time_stamp = init_time_stamp

    def get_ts(self, event):
        return self.init_time_stamp + self.init_event.elapsed_time(event) * 1e+3

    def profiling_data_parallel(self, init_time_stamp, init_event):
        self.set_time_stamp(init_time_stamp, init_event)
        profiling_log = []

        assert self.flatten
        allreduce_slot = self.sync_gradients_start_event.elapsed_time(self.sync_gradients_ready_event)*1e+3
        allreduce_log = {"name": "opt_shardedPS_sync", "ph": "X", "pid": self.global_rank, "tid": "7. optimizer-comm",
                         "ts": self.get_ts(self.sync_gradients_start_event),
                         "dur": allreduce_slot, "cname": "cq_build_passed",
                         "args": {'para': 'flattened_grad', 'size': self.flatten_para.grad.numel()}}
        # print(allreduce_log)
        profiling_log.append(allreduce_log)

        optimizer_slot = self.optimizer_step_start_event.elapsed_time(self.optimizer_step_ready_event) * 1e+3
        optimizer_log = {"name": "opt_comp", "ph": "X", "pid": self.global_rank, "tid": "8. optimizer-comp",
                         "ts": self.get_ts(self.optimizer_step_start_event), "dur": optimizer_slot, "cname": "bad"}
        # print(optimizer_log)
        profiling_log.append(optimizer_log)
        return profiling_log


================================================
FILE: training/data_parallel/dist_dp_utils.py
================================================
from .dist_dp_allreduce import AllReduceDP
from .dist_dp_sharded_ps import ShardedPSDP
from .dist_dp_local import LocalDP


def get_dp_module(args, device, module, optimizer):
    print("Data parallel implementation: ", args.dp_mode)
    if args.dp_mode == 'allreduce':
        return AllReduceDP(args, device, module, optimizer, flatten=False) 
        # flatten gradient is not compatible with fp16 now
    elif args.dp_mode == 'local':
        return LocalDP(args, device, module, optimizer, flatten=False)
    elif args.dp_mode == 'sharded_ps':
        return ShardedPSDP(args, device, module, optimizer, flatten=False)
    else:
        print("Not recognize this data parallel mode.")
        assert False


================================================
FILE: training/data_parallel/flatten_utils.py
================================================
import torch


def _assert_contiguous(tensors):
    data_ptr = None
    for t in tensors:
        if data_ptr is not None:
            assert t.data_ptr() == data_ptr
        data_ptr = t.data_ptr() + t.numel() * t.element_size()


def flatten_params(param_set, chunk=None):
    params = [p for p in param_set]
    weights = [p.data for p in params]
    grads = [p.grad.data if p.grad is not None else torch.zeros_like(p.data) for p in params]
    sizes = [p.numel() for p in params]
    total_size = sum(sizes)
    if chunk:
        total_size = ((total_size+chunk-1)//chunk)*chunk

    flatten_weights_tensor = torch.zeros(total_size, dtype=weights[0].dtype).to(weights[0].device)
    flatten_grads_tensor = torch.zeros(total_size, dtype=weights[0].dtype).to(weights[0].device)
    flatten_weights_storage = flatten_weights_tensor.storage()
    flatten_grads_storage = flatten_grads_tensor.storage()

    def set_storage(param, weight_storage, grad_storage, storage_offset):
        with torch.no_grad():
            z = torch.zeros_like(param.data)
            z.set_(weight_storage, storage_offset, param.shape)
            param.data = z

            t = torch.zeros_like(param.data)
            t.set_(grad_storage, storage_offset, param.shape)
            param.grad = t

    offset = 0
    for i in range(len(params)):
        flatten_weights_tensor[offset: offset + weights[i].numel()] = weights[i].reshape(-1)
        flatten_grads_tensor[offset: offset + grads[i].numel()] = grads[i].reshape(-1)
        set_storage(params[i], flatten_weights_storage, flatten_grads_storage, offset)
        offset += sizes[i]

    weight_tensors = [p.data for p in params]
    grad_tensors = [p.grad.data for p in params]

    _assert_contiguous(weight_tensors)
    _assert_contiguous(grad_tensors)

    with torch.no_grad():
        flatten_para = torch.nn.Parameter(flatten_weights_tensor, requires_grad=False)
        flatten_para.grad = flatten_grads_tensor
        return flatten_para
    

def flatten_tensors(tensor_set, chunk=None):
    tensors = [p for p in tensor_set]
    weights = [p.data for p in tensors]
    sizes = [p.numel() for p in tensors]
    total_size = sum(sizes)
    if chunk:
        total_size = ((total_size+chunk-1)//chunk)*chunk

    flatten_weights_tensor = torch.zeros(total_size, dtype=weights[0].dtype).to(weights[0].device)
    flatten_weights_storage = flatten_weights_tensor.storage()

    def set_storage(param, weight_storage, storage_offset):
        with torch.no_grad():
            z = torch.zeros_like(param.data)
            z.set_(weight_storage, storage_offset, param.shape)
            param.data = z

    offset = 0
    for i in range(len(tensors)):
        flatten_weights_tensor[offset: offset + weights[i].numel()] = weights[i].reshape(-1)
        set_storage(tensors[i], flatten_weights_storage, offset)
        offset += sizes[i]

    return flatten_weights_tensor


================================================
FILE: training/dist_clm_train.py
================================================
import argparse
import time
import random
import numpy as np
import torch
import torch.autograd.profiler as profiler
from tasks.data_loaders.data_utils import get_train_data_loader, get_eval_data_loader
from modules.utils import gpt_loss_func
from modules.tokenizer import build_tokenizer
from pipeline_parallel.dist_pp_utils import get_pp_module

from transformers import AutoConfig
import datasets

from utils.dist_args_utils import *
from utils.dist_checkpoint_utils import *
from utils.logging_utils import *
from utils.event_report import *
from comm.comm_utils import *

from utils.upload_manager import *


def test_loop(args, pipe, device, test_data_loader):
    
    if test_data_loader is None:
        return
    
    print('testing starts.....')
    
    pipe.model.eval()
    
    if get_pipeline_parallel_rank()  == args.pipeline_group_size - 1:
        
        def _lm_pred_func(x, y):
            loss_fct = torch.nn.CrossEntropyLoss(reduction='none')
            logits = x[:, :-1, :].contiguous().float()
            labels = y[:, 1:].contiguous()
            loss = loss_fct(logits.transpose(-1, -2), labels).mean(1).detach().cpu()
            return loss
        
        loss_list = []
        for i, data in enumerate(test_data_loader):
            
            if args.evaluation_num_batch is not None and i >= args.evaluation_num_batch:
                break
                
            input_ids = data['input_ids'].to(device)
            labels = input_ids.clone()
            pipe.infer_iter(input_ids, labels, output_=loss_list, pred_func=_lm_pred_func)
            
        loss = torch.tensor(loss_list).mean()
        ppls = torch.exp(loss)
        metric = {"valid.perplexity": ppls.item(), "valid.loss": loss.item()}
        
        print(metric)
        train_log(
            metric, 
            step=pipe.global_step,
        )
        
    else:
        for i, data in enumerate(test_data_loader):
            
            if args.evaluation_num_batch is not None and i >= args.evaluation_num_batch:
                break
            
            input_ids = data['input_ids'].to(device)
            labels = input_ids.clone()
            current_iter_time = pipe.infer_iter(input_ids, labels)
    
    pipe.model.train()
    


def train_loop(args, pipe, device, train_data_loader, test_data_loader, steps_per_epoch):
    
    print('training starts......')

    event_reporter = EventReporter(host=args.event_host, auth_token=args.event_auth_token, job_id=args.job_id)

    pipe.model.train() # Flag .training to True to enable Dropout
    
    use_dp = (args.world_size != args.pipeline_group_size)
    if use_dp:
        # dp_comm = get_data_parallel_comm()
        dp_rank = get_data_parallel_rank()
        dp_size = get_data_parallel_world_size()
    else:
        dp_rank = 0
        dp_size = 1
    pp_comm = get_pipeline_parallel_comm()
    
    stop_flag = torch.zeros(1, dtype=torch.int64).to(device)
    
    input_ids = torch.zeros(
        [args.batch_size, args.seq_length], 
        dtype=torch.int64
    ).to(device)
    
    do_sync_before_save = (args.dp_mode in ['local'] and use_dp)

    # Get the number of model parameters for the model
    param_count = torch.zeros(1, dtype=torch.int64).to(device)
    local_param_count = sum(p.numel() for p in pipe.model.parameters())
    param_count.data[:] = local_param_count
    pp_comm.reduce(param_count, 0)

    if get_pipeline_parallel_rank() == 0 and dp_rank == 0:

        print(f"Training steps:  total_steps={args.total_steps},  steps_per_epoch={steps_per_epoch},  steps_per_checkpoint={args.checkpoint_steps}")

        upload_checkpoints_enabled = args.checkpoint_upload_prefix is not None 
        upload_manager = UploadManager(aws_endpoint_url = args.aws_endpoint_url,
                                       aws_access_key_id = args.aws_access_key_id,
                                       aws_secret_access_key = args.aws_secret_access_key,
                                       aws_session_token = args.aws_session_token,
                                       aws_region = args.aws_region,
                                       event_reporter = event_reporter,
                                       n_stages = args.pipeline_group_size)

        if event_reporter is not None:

            # Get the number of tokens in the dataset
            token_count = train_data_loader.dataset.get_dataset_token_count()

            # Report training start
            event_reporter.report(object=EventReporter.OBJECT_FINE_TUNE,
                                  message=f"Training started for model {args.model_name}",
                                  event_type=EventReporter.EVENT_TYPE_TRAINING_START,
                                  param_count=param_count.item(),
                                  token_count=token_count,
                                  requires_is_enabled=False)
        
        for i, data in enumerate(train_data_loader):
            # if i < pipe.global_step:
            #     continue
                
            if use_dp:
                get_data_parallel_comm().broadcast(stop_flag, 0)
            pp_comm.broadcast(stop_flag, 0)
            
            if stop_flag.item() == 1:
                break
            
            input_ids_global = data['input_ids'].to(torch.int64).to(device)
            
            input_ids_list = input_ids_global.chunk(dp_size)
            
            if use_dp:
                for j in range(1, dp_size):
                    get_data_parallel_comm().send(
                        input_ids_list[j], j,
                    )
                
            input_ids = input_ids_list[0]
            
            pp_comm.broadcast(input_ids, 0)
            
            labels = input_ids.clone()
            current_iter_time = pipe.sgd_iter(input_ids, labels, loss_func=gpt_loss_func)

            if event_reporter is not None and (pipe.global_step >= args.total_steps or pipe.global_step % steps_per_epoch == 0):
                event_reporter.report(object=EventReporter.OBJECT_FINE_TUNE,
                                      message=f"Epoch completed, at step {pipe.global_step}",
                                      event_type=EventReporter.EVENT_TYPE_EPOCH_COMPLETE,
                                      requires_is_enabled=False)
            
            if args.evaluation_steps > 0 and pipe.global_step % args.evaluation_steps == 0:
                test_loop(args, pipe, device, test_data_loader)
            
            if pipe.global_step >= args.total_steps or pipe.global_step % args.checkpoint_steps == 0:
                if do_sync_before_save:
                    pipe.dp_optim.allreduce_parameters()
                if dp_rank == 0:
                    checkpoint_step_path = save_checkpoint(pipe, args)
                    if upload_checkpoints_enabled:
                        upload_manager.add_task(directory=checkpoint_step_path,
                                                checkpoint_upload_prefix=args.checkpoint_upload_prefix,
                                                step=pipe.global_step)

                if do_sync_before_save:
                    pipe.dp_optim.rollback_parameters()
            
            if pipe.global_step >= args.total_steps:
                stop_flag.data[:] = 1
        
        if upload_checkpoints_enabled:
            upload_manager.wait()
            
    elif get_pipeline_parallel_rank() == 0:
        
        while True:
            
            get_data_parallel_comm().broadcast(stop_flag, 0)
            pp_comm.broadcast(stop_flag, 0)
            if stop_flag.item() == 1:
                break
                
            get_data_parallel_comm().recv(
                input_ids, 0,
            )
            pp_comm.broadcast(input_ids, 0)
            
            labels = input_ids.clone()
            current_iter_time = pipe.sgd_iter(input_ids, labels, loss_func=gpt_loss_func)
            
            if args.evaluation_steps > 0 and pipe.global_step % args.evaluation_steps == 0:
                test_loop(args, pipe, device, test_data_loader)
                
            if pipe.global_step >= args.total_steps or pipe.global_step % args.checkpoint_steps == 0:
                if do_sync_before_save:
                    pipe.dp_optim.allreduce_parameters()
                if dp_rank == 0:
                    save_checkpoint(pipe, args)
                if do_sync_before_save:
                    pipe.dp_optim.rollback_parameters()
            
            
    elif get_pipeline_parallel_rank()  == args.pipeline_group_size - 1:
        
        while True:
            
            pp_comm.broadcast(stop_flag, 0)
            if stop_flag.item() == 1:
                break
                
            pp_comm.broadcast(input_ids, 0)
            labels = input_ids.clone()
            current_iter_time = pipe.sgd_iter(input_ids, labels, loss_func=gpt_loss_func) # lm loss func
            
            if args.evaluation_steps > 0 and pipe.global_step % args.evaluation_steps == 0:
                test_loop(args, pipe, device, test_data_loader)
                
            if pipe.global_step >= args.total_steps or pipe.global_step % args.checkpoint_steps == 0:
                if do_sync_before_save:
                    pipe.dp_optim.allreduce_parameters()
                if dp_rank == 0:
                    save_checkpoint(pipe, args)
                    pipe.save_on_disk(args.checkpoint_path)
                if do_sync_before_save:
                    pipe.dp_optim.rollback_parameters()
    else:
        while True:
            pp_comm.broadcast(stop_flag, 0)
            if stop_flag.item() == 1:
                break
            pp_comm.broadcast(input_ids, 0)
            current_iter_time = pipe.sgd_iter(None, None)
            
            if args.evaluation_steps > 0 and pipe.global_step % args.evaluation_steps == 0:
                test_loop(args, pipe, device, test_data_loader)
                
            if pipe.global_step >= args.total_steps or pipe.global_step % args.checkpoint_steps == 0:
                if do_sync_before_save:
                    pipe.dp_optim.allreduce_parameters()
                if dp_rank == 0:
                    save_checkpoint(pipe, args)
                if do_sync_before_save:
                    pipe.dp_optim.rollback_parameters()

# Compute the total number of training steps, steps per epoch, and steps per
# checkpoint
def calculate_training_steps(args, train_data_loader) -> int:
    total_steps = 0
    steps_per_epoch = 0
    steps_per_checkpoint = 0

    token_count = train_data_loader.dataset.get_dataset_token_count()

    # Check the inputs to calculate the total steps
    if args.batch_size is None or args.world_size is None or args.pipeline_group_size is None or token_count is None or args.seq_length is None:
        print("Missing required arguments for calculating total steps based on epochs.")
        sys.exit(1)

    global_batch_size = (args.batch_size * args.world_size + args.pipeline_group_size - 1) // args.pipeline_group_size
    tokens_per_batch = global_batch_size * args.seq_length
    steps_per_epoch = (token_count + tokens_per_batch - 1) // tokens_per_batch

    if args.total_steps is not None:
        if args.nepochs is not None:
            print("WARNING: total_steps ({args.toal_steps}) supercedes nepochs ({args.nepochs}).")
        total_steps = args.total_steps
    elif args.nepochs is not None:
        total_steps = steps_per_epoch * args.nepochs
    else:
        total_steps = len(train_data_loader)

    # Set the minimum number of total steps
    if total_steps < 10:
        total_steps = 10

    # Ensure that the steps per epoch are consistent with total steps
    # Note: This does not strictly follow the definition of an epoch. It just
    # approximately distributes the reporting of epochs over the total number of
    # steps.
    if args.nepochs is not None:
        steps_per_epoch = (total_steps + args.nepochs - 1) // args.nepochs

    # clamp steps_per_epoch to [1, total_steps]
    if steps_per_epoch > total_steps:
        steps_per_epoch = total_steps
    if steps_per_epoch < 1:
        steps_per_epoch = 1

    # Set the number of steps per epoch based on user input.
    if args.checkpoint_steps is not None and args.checkpoint_steps > 0:
        steps_per_checkpoint = args.checkpoint_steps
    elif args.num_checkpoints is not None and args.num_checkpoints > 0:
        steps_per_checkpoint = (total_steps + args.num_checkpoints - 1) // args.num_checkpoints
    else:
        steps_per_checkpoint = total_steps
    
    # Clamp steps_per_checkpoint to [1, total_steps]
    if steps_per_checkpoint > total_steps:
        steps_per_checkpoint = total_steps
    if steps_per_checkpoint < 1:
        steps_per_checkpoint = 1

    # Set the args base on what we computed above
    args.total_steps = total_steps
    args.checkpoint_steps = steps_per_checkpoint
    return steps_per_epoch

def main():
    parser = argparse.ArgumentParser(description='Gpipe-GPT')
    add_device_arguments(parser)
    add_torch_distributed_arguments(parser)
    add_model_arguments(parser)
    add_task_arguments(parser)
    add_training_hyper_parameter_arguments(parser)
    add_mixed_precision_arguments(parser)
    add_parallel_schema_arguments(parser)
    add_entry_reporter_arguments(parser)
    parser.add_argument('--model-name', type=str, default='gpt2', metavar='S',
                        help='model name or path')
    parser.add_argument('--tokenizer-name', type=str, default='gpt2', metavar='S',
                        help='tokenizer name or path')
    parser.add_argument('--model-type', type=str, default='gpt2', metavar='S',
                        help='model name or path')
    parser.add_argument('--checkpoint-path', type=str, default='model_checkpoints/gpt2')
    parser.add_argument('--task-name', type=str, default='cot', metavar='S',
                        help='task name')
    parser.add_argument('--warmup-steps', type=int, default=0, help='-')
    parser.add_argument('--train-warmup-steps', type=int, default=0, help='-')
    parser.add_argument('--nepochs', type=int, default=None, help='-')
    parser.add_argument('--total-steps', type=int, default=None, help='-')
    parser.add_argument('--load-pretrained-model', 
                        type=lambda x: x.lower()=='true', default=True, metavar='S',
                        help='load pretrained model or not.')
    parser.add_argument('--load-checkpoint', 
                        type=lambda x: x.lower()=='true', default=True, metavar='S',
                        help='load pretrained model or not.')
    parser.add_argument('--seed', type=int, default=1, metavar='S',
                        help='random seed (default: 1)')
    parser.add_argument('--profiling', type=str, default='no-profiling', metavar='S',
                        help='enable which profiling? default: tidy mode')
    parser.add_argument('--trace-postfix', type=str, default='default', metavar='S',
                        help='postfix of the tracing file name.')
    parser.add_argument('--evaluation-steps', 
                        type=int, default=0, metavar='S',
                        help='every x steps, do evaluation. (0 means do not do evaluation)')
    parser.add_argument('--evaluation-data',
                        type=str, default=None, help="path of eval data in jsonl")
    parser.add_argument('--evaluation-num-batch',
                        type=int, default=None, help="for debug purpose, only eval the first several batch.")
    parser.add_argument('--checkpoint-steps', 
                        type=int, default=0, metavar='S',
                        help='every x steps, save checkpoint. (0 means do not save checkpoint)')
    parser.add_argument('--num-checkpoints', 
                        type=int, default=0, metavar='S',
                        help='number of checkpoints to save')
    parser.add_argument('--net-interface', 
                        type=str, default='lo', metavar='S',
                        help='net_interface')
    parser.add_argument('--job-id', 
                        type=str, default="0", metavar='S',
                        help='an uuid')
    
    # Add AWS arguments for uploading checkpoints to S3
    parser.add_argument('--checkpoint-upload-prefix', default=None, help='S3 bucket name')
    add_aws_arguments(parser)

    args = parser.parse_args()
    aws_process_args(args)
    
    torch.manual_seed(args.seed)
    random.seed(args.seed)
    np.random.seed(args.seed)
    
    if args.use_cuda:
        assert (torch.cuda.is_available())
        device = torch.device('cuda', args.cuda_id)
    else:
        device = torch.device('cpu')
        
    init_communicators(args)
    
    use_dp = (args.world_size != args.pipeline_group_size)
    if use_dp:
        dp_comm = get_data_parallel_comm()
        dp_rank = get_data_parallel_rank()
        dp_size = get_data_parallel_world_size()
    else:
        dp_rank = 0
        dp_size = 1
    
    config = AutoConfig.from_pretrained(args.model_name)
    
    # num layer globally
    if hasattr(config, 'num_hidden_layers'):
        args.max_layers = config.num_hidden_layers
    elif hasattr(config, 'num_layers'):
        args.max_layers = config.num_layers 
    else:
        args.max_layers = config.n_layer
    
    tokenizer = build_tokenizer(args)
    tokenizer.model_max_length = args.seq_length
    config.max_position_embeddings = args.seq_length
    # config.vocab_size = tokenizer.vocab_size
    config.bos_token_id = tokenizer.bos_token_id
    config.eos_token_id = tokenizer.eos_token_id
    config.pad_token_id = tokenizer.pad_token_id
    print("token vocab size:", config.vocab_size)
    
    train_data_loader = get_train_data_loader(args, tokenizer)
        
    if args.evaluation_data is not None and dp_rank == 0:
        test_data_loader = get_eval_data_loader(args, tokenizer)
    else:
        test_data_loader = None
    
    # calculate total steps
    steps_per_epoch = calculate_training_steps(args, train_data_loader)
    
    use_dp = (args.world_size != args.pipeline_group_size)
    if use_dp:
        print("Running ", args.pp_mode, " with data parallel.")
    else:
        print("Running ", args.pp_mode, " without data parallel.")
    
    pipe = get_pp_module(args, config, device, use_dp)
    
    if args.load_checkpoint:
        load_checkpoint(pipe, args)

    if args.fp16:
        pipe.optimizer.reload_model_params()

    if args.profiling == 'no-profiling':
        train_loop(args, pipe, device, train_data_loader, test_data_loader, steps_per_epoch)
    else:
        prefix = './trace_json/gpt3_' + args.pp_mode
        if use_dp:
            prefix = prefix + '_' + args.dp_mode
        trace_file = prefix + get_learning_arguments_str(args) + get_model_arguments_str(args) + \
                     get_dist_arguments_str(args) + get_mixed_precision_arguments_str(args) + '_' + \
                     args.profiling + '_' + args.trace_postfix + '.json'
        if args.profiling == 'tidy_profiling':
            try:
                train_loop(args, pipe, device, train_data_loader, test_data_loader, steps_per_epoch)
            except Exception as e:
                raise e
                print(get_pipeline_parallel_rank(), e)
            pipe.export_profiling_result(filename=trace_file)
        elif args.profiling == 'pytorch_profiling':
            with profiler.profile(profile_memory=True, use_cuda=args.use_cuda) as prof:
                train_loop(args, pipe, device, train_data_loader, test_data_loader, steps_per_epoch)
            print(prof.key_averages().table())
            prof.export_chrome_trace(trace_file)
        else:
            print("No recognized profiler?")
            assert False
    print(get_pipeline_parallel_rank(), 'finished.')

if __name__ == '__main__':
    main()


================================================
FILE: training/dist_prefixlm_train.py
================================================
import argparse
import time
import random
import numpy as np
import torch
import torch.autograd.profiler as profiler
from tasks.data_loaders.data_utils import get_ul2r_train_data_loader
from modules.utils import gpt_loss_func
from modules.tokenizer import build_tokenizer
from pipeline_parallel.dist_pp_utils import get_pp_module

from transformers import AutoConfig
import datasets

from utils.dist_args_utils import *
from utils.dist_checkpoint_utils import *
from utils.logging_utils import *
from comm.comm_utils import *


def test_loop(args, pipe, device, test_data_loader):
    print("no impl for testing, skip.")


def train_loop(args, pipe, device, train_data_loader, test_data_loader):
    
    print('training starts......')

    pipe.model.train() # Flag .training to True to enable Dropout
    
    use_dp = (args.world_size != args.pipeline_group_size)
    if use_dp:
        # dp_comm = get_data_parallel_comm()
        dp_rank = get_data_parallel_rank()
        dp_size = get_data_parallel_world_size()
    else:
        dp_rank = 0
        dp_size = 1
    pp_comm = get_pipeline_parallel_comm()
    
    stop_flag = torch.zeros(1, dtype=torch.int64).to(device)
    
    input_ids = torch.zeros(
        [args.batch_size, args.seq_length], 
        dtype=torch.int64
    ).to(device)
    
    prefix_masks = torch.zeros(
        [args.batch_size, args.seq_length], 
        dtype=torch.uint8
    ).to(device)
    
    do_sync_before_save = (args.dp_mode in ['local'] and use_dp)
    
    if get_pipeline_parallel_rank() == 0 and dp_rank == 0:
        
        for i, data in enumerate(train_data_loader):
            if i < pipe.global_step:
                continue
                
            if use_dp:
                get_data_parallel_comm().broadcast(stop_flag, 0)
            pp_comm.broadcast(stop_flag, 0)
            
            if stop_flag.item() == 1:
                break
            
            input_ids_global = data['input_ids'].to(torch.int64).to(device)
            prefix_masks_global = data['prefix_masks'].to(torch.uint8).to(device)
            
            input_ids_list = input_ids_global.chunk(dp_size)
            prefix_masks_list = prefix_masks_global.chunk(dp_size)
            
            if use_dp:
                for j in range(1, dp_size):
                    get_data_parallel_comm().send(
                        input_ids_list[j], j,
                    )
                    get_data_parallel_comm().send(
                        prefix_masks_list[j], j,
                    )
                
            input_ids = input_ids_list[0]
            prefix_masks = prefix_masks_list[0]
            
            pp_comm.broadcast(input_ids, 0)
            pp_comm.broadcast(prefix_masks, 0)
            
            labels = input_ids.clone()
            current_iter_time = pipe.sgd_iter(
                input_ids, labels, aux_input_data={'prefix_masks': prefix_masks}, loss_func=gpt_loss_func
            )
            
            if args.evaluation_steps > 0 and pipe.global_step % args.evaluation_steps == 0:
                test_loop(args, pipe, device, test_data_loader)
            
            if pipe.global_step % args.checkpoint_steps == 0:
                if do_sync_before_save:
                    pipe.dp_optim.allreduce_parameters()
                if dp_rank == 0:
                    save_checkpoint(pipe, args)
                if do_sync_before_save:
                    pipe.dp_optim.rollback_parameters()
            
            if pipe.global_step >= args.total_steps:
                stop_flag.data[:] = 1
            
    elif get_pipeline_parallel_rank() == 0:
        
        while True:
            
            get_data_parallel_comm().broadcast(stop_flag, 0)
            pp_comm.broadcast(stop_flag, 0)
            if stop_flag.item() == 1:
                break
                
            get_data_parallel_c

Download .txt

gitextract_ib4k5pq4/

├── .github/
│   └── ISSUE_TEMPLATE/
│       ├── bug_report.md
│       ├── feature_request.md
│       └── openchatkit-feedback-report.yaml
├── .gitignore
├── LICENSE
├── README.md
├── data/
│   ├── OIG/
│   │   └── prepare.py
│   ├── OIG-chip2/
│   │   └── prepare.sh
│   ├── OIG-moderation/
│   │   └── prepare.py
│   ├── prepare_data.py
│   └── wikipedia-3sentence-level-retrieval-index/
│       └── prepare.py
├── docs/
│   ├── GPT-NeoXT-Chat-Base-20B.md
│   └── finetuning-RedPajama-3B.md
├── environment.yml
├── inference/
│   ├── README.md
│   ├── bot.py
│   └── conversation.py
├── pretrained/
│   ├── GPT-NeoX-20B/
│   │   └── prepare.py
│   ├── Llama-2-7B-32K-beta/
│   │   └── prepare.py
│   ├── Pythia-6.9B-deduped/
│   │   └── prepare.py
│   ├── RedPajama-3B/
│   │   └── prepare.py
│   ├── RedPajama-7B/
│   │   └── prepare.py
│   └── prepare_pretrained.py
├── retrieval/
│   ├── README.md
│   ├── __init__.py
│   └── wikipedia.py
├── tools/
│   ├── README.md
│   ├── benchmark_input.json
│   ├── convert_to_hf_gptneox.py
│   ├── convert_to_hf_llama.py
│   └── model_load_benchmark.py
└── training/
    ├── README.md
    ├── comm/
    │   ├── __init__.py
    │   ├── comm_utils.py
    │   ├── nccl_backend.py
    │   └── torch_backend.py
    ├── data_parallel/
    │   ├── __init__.py
    │   ├── dist_dp_allreduce.py
    │   ├── dist_dp_central_ps.py
    │   ├── dist_dp_local.py
    │   ├── dist_dp_sharded_ps.py
    │   ├── dist_dp_utils.py
    │   └── flatten_utils.py
    ├── dist_clm_train.py
    ├── dist_prefixlm_train.py
    ├── finetune_GPT-NeoXT-Chat-Base-20B.sh
    ├── finetune_Pythia-Chat-Base-7B.sh
    ├── finetune_RedPajama-INCITE-7B-Chat.sh
    ├── finetune_RedPajama-INCITE-Chat-3B-v1.sh
    ├── finetune_llama-2-7b-32k-booksum.sh
    ├── finetune_llama-2-7b-32k-mqa.sh
    ├── lora/
    │   └── example/
    │       ├── redpajama-incite-chat-3b.py
    │       └── redpajama-incite-chat-3b_inference.py
    ├── modules/
    │   ├── __init__.py
    │   ├── deberta_modules.py
    │   ├── dist_deberta_pp_module.py
    │   ├── dist_gpt_fsdp_module.py
    │   ├── dist_gpt_pp_module.py
    │   ├── hf_gpt2_modules.py
    │   ├── hf_gptj_modules.py
    │   ├── hf_gptneox_modules.py
    │   ├── hf_opt_modules.py
    │   ├── llama_modules.py
    │   ├── task_modules.py
    │   ├── tokenizer.py
    │   └── utils.py
    ├── optimizer/
    │   ├── __init__.py
    │   ├── grad_scalar.py
    │   └── optimizer.py
    ├── pipeline_parallel/
    │   ├── __init__.py
    │   ├── dist_gpipe_pipeline_async.py
    │   └── dist_pp_utils.py
    ├── tasks/
    │   ├── __init__.py
    │   └── data_loaders/
    │       ├── __init__.py
    │       ├── data_utils.py
    │       └── prosocial.py
    └── utils/
        ├── __init__.py
        ├── dist_args_utils.py
        ├── dist_checkpoint_utils.py
        ├── dist_debug_utils.py
        ├── event_report.py
        ├── logging_utils.py
        └── upload_manager.py

Download .txt

SYMBOL INDEX (475 symbols across 44 files)

FILE: data/prepare_data.py
  function is_git_lfs_installed (line 17) | def is_git_lfs_installed():
  function is_huggingface_git_url (line 27) | def is_huggingface_git_url(url):
  function is_github_repo_url (line 36) | def is_github_repo_url(url):
  function is_s3_url (line 47) | def is_s3_url(url):
  function clone_git_repo (line 77) | def clone_git_repo(data_source, destination_dir):
  function download_from_s3 (line 111) | def download_from_s3(url, destination_dir, access_key_id = None,
  function download_from_url (line 185) | def download_from_url(url, destination_dir):
  function prepare_data (line 218) | def prepare_data(data_source, destination_dir, access_key_id=None, secre...
  function main (line 253) | def main():

FILE: inference/bot.py
  class StopWordsCriteria (line 18) | class StopWordsCriteria(StoppingCriteria):
    method __init__ (line 19) | def __init__(self, tokenizer, stop_words, stream_callback):
    method __call__ (line 26) | def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTen...
  class ChatModel (line 47) | class ChatModel:
    method __init__ (line 51) | def __init__(self, model_name, gpu_id, max_memory):
    method do_inference (line 86) | def do_inference(self, prompt, max_new_tokens, do_sample, temperature,...
  class OpenChatKitShell (line 109) | class OpenChatKitShell(cmd.Cmd):
    method __init__ (line 113) | def __init__(self, gpu_id, model_name_or_path, max_tokens, sample, tem...
    method preloop (line 125) | def preloop(self):
    method precmd (line 136) | def precmd(self, line):
    method do_say (line 142) | def do_say(self, arg):
    method do_raw_say (line 163) | def do_raw_say(self, arg):
    method do_raw_prompt (line 174) | def do_raw_prompt(self, arg):
    method do_reset (line 177) | def do_reset(self, arg):
    method do_hyperparameters (line 181) | def do_hyperparameters(self, arg):
    method do_quit (line 190) | def do_quit(self, arg):
  function main (line 194) | def main():

FILE: inference/conversation.py
  function clean_response (line 11) | def clean_response(response):
  class Conversation (line 17) | class Conversation:
    method __init__ (line 18) | def __init__(self, human_id, bot_id):
    method push_context_turn (line 26) | def push_context_turn(self, context):
    method push_human_turn (line 30) | def push_human_turn(self, query):
    method push_model_response (line 34) | def push_model_response(self, response):
    method get_last_turn (line 44) | def get_last_turn(self):
    method get_raw_prompt (line 50) | def get_raw_prompt(self):
    method from_raw_prompt (line 54) | def from_raw_prompt(cls, value):

FILE: pretrained/prepare_pretrained.py
  function prepare_pretrained (line 10) | def prepare_pretrained(save_path, model_name, offload_dir=None):
  function main (line 53) | def main():

FILE: retrieval/wikipedia.py
  function mean_pooling (line 16) | def mean_pooling(token_embeddings, mask):
  function cos_sim_2d (line 21) | def cos_sim_2d(x, y):
  class WikipediaIndex (line 27) | class WikipediaIndex:
    method __init__ (line 28) | def __init__(self):
    method search (line 42) | def search(self, query, k=1, w=5, w_th=0.5):

FILE: tools/convert_to_hf_gptneox.py
  function create_empty_gptneox (line 14) | def create_empty_gptneox(config):
  function load_decentralized_checkpoint (line 33) | def load_decentralized_checkpoint(model, checkpoint_path, n_stages=2, n_...

FILE: tools/convert_to_hf_llama.py
  function create_emtpy_llama (line 15) | def create_emtpy_llama(config):
  function load_decentralized_checkpoint (line 34) | def load_decentralized_checkpoint(model, checkpoint_path, n_stages=2, n_...

FILE: tools/model_load_benchmark.py
  function benchmark (line 12) | def benchmark(model_dict: dict, device_name: str, repeat_infer: int):
  function main (line 132) | def main(input_file: str, output_file: str, device_name: str, repeat_inf...

FILE: training/comm/comm_utils.py
  function get_lock (line 20) | def get_lock():
  function get_data_parallel_comm (line 23) | def get_data_parallel_comm() -> NCCLCommunicator:
  function get_data_parallel_rank (line 28) | def get_data_parallel_rank() -> int:
  function get_data_parallel_world_size (line 33) | def get_data_parallel_world_size() -> int:
  function get_pipeline_parallel_comm (line 38) | def get_pipeline_parallel_comm() -> NCCLCommunicator:
  function get_pipeline_parallel_rank (line 43) | def get_pipeline_parallel_rank() -> int:
  function get_pipeline_parallel_world_size (line 48) | def get_pipeline_parallel_world_size() -> int:
  function get_megatron_tensor_parallel_comm (line 53) | def get_megatron_tensor_parallel_comm() -> NCCLCommunicator:
  function get_megatron_tensor_parallel_rank (line 58) | def get_megatron_tensor_parallel_rank() -> int:
  function get_megatron_tensor_parallel_world_size (line 63) | def get_megatron_tensor_parallel_world_size() -> int:
  function default_init (line 68) | def default_init(args):
  function init_communicators (line 84) | def init_communicators(args):
  function reinit_dp_communicator (line 159) | def reinit_dp_communicator(args):

FILE: training/comm/nccl_backend.py
  function _type_torch_to_cupy (line 9) | def _type_torch_to_cupy(torch_type: torch.dtype):
  class NCCLCommunicator (line 24) | class NCCLCommunicator:
    method __init__ (line 25) | def __init__(self,
    method barrier (line 51) | def barrier():
    method store_set (line 54) | def store_set(self, key, value):
    method store_get (line 57) | def store_get(self, key):
    method send (line 60) | def send(self,
    method recv (line 73) | def recv(self,
    method broadcast (line 87) | def broadcast(self,
    method reduce (line 99) | def reduce(self,
    method all_reduce (line 114) | def all_reduce(self,
    method scatter (line 127) | def scatter(self,
    method gather (line 147) | def gather(self,
    method all_to_all (line 167) | def all_to_all(self,
    method all_gather (line 178) | def all_gather(self,
    method all_reduce_opt (line 190) | def all_reduce_opt(self,

FILE: training/comm/torch_backend.py
  class TorchCommunicator (line 5) | class TorchCommunicator:
    method __init__ (line 7) | def __init__(self,
    method barrier (line 18) | def barrier(self):
    method send (line 21) | def send(self,
    method recv (line 31) | def recv(self,
    method isend (line 43) | def isend(self,
    method irecv (line 54) | def irecv(self,
    method broadcast (line 67) | def broadcast(self,
    method reduce (line 78) | def reduce(self,
    method all_reduce (line 85) | def all_reduce(self,
    method gather (line 93) | def gather(self,
    method all_to_all (line 100) | def all_to_all(self,
    method all_gather (line 106) | def all_gather(self,

FILE: training/data_parallel/dist_dp_allreduce.py
  class AllReduceDP (line 6) | class AllReduceDP:
    method __init__ (line 7) | def __init__(self, args, device, module: torch.nn.Module, optimizer: t...
    method _compute_total_para_num (line 51) | def _compute_total_para_num(self):
    method profile_mark_allreduce_start (line 60) | def profile_mark_allreduce_start(self, name=None):
    method profile_mark_allreduce_end (line 67) | def profile_mark_allreduce_end(self, name=None):
    method profile_mark_optimizer_step_start (line 72) | def profile_mark_optimizer_step_start(self):
    method _allreduce_gradients (line 76) | def _allreduce_gradients(self):
    method optimizer_step (line 93) | def optimizer_step(self):
    method set_time_stamp (line 101) | def set_time_stamp(self, init_time_stamp, init_event):
    method get_ts (line 105) | def get_ts(self, event):
    method profiling_data_parallel (line 108) | def profiling_data_parallel(self, init_time_stamp, init_event):

FILE: training/data_parallel/dist_dp_central_ps.py
  class CentralPSDP (line 6) | class CentralPSDP:
    method __init__ (line 7) | def __init__(self, args, device, module: torch.nn.Module, optimizer: t...
    method _compute_total_para_num (line 58) | def _compute_total_para_num(self):
    method profile_mark_reduce_start (line 67) | def profile_mark_reduce_start(self, name=None):
    method profile_mark_reduce_end (line 74) | def profile_mark_reduce_end(self, name=None):
    method profile_mark_optimizer_step_start (line 81) | def profile_mark_optimizer_step_start(self):
    method profile_mark_broadcast_start (line 85) | def profile_mark_broadcast_start(self, name=None):
    method profile_mark_broadcast_end (line 92) | def profile_mark_broadcast_end(self, name=None):
    method _reduce_gradients (line 97) | def _reduce_gradients(self):
    method _broadcast_reduced_gradients (line 111) | def _broadcast_reduced_gradients(self):
    method optimizer_step (line 125) | def optimizer_step(self):
    method set_time_stamp (line 134) | def set_time_stamp(self, init_time_stamp, init_event):
    method get_ts (line 138) | def get_ts(self, event):
    method profiling_data_parallel (line 141) | def profiling_data_parallel(self, init_time_stamp, init_event):

FILE: training/data_parallel/dist_dp_local.py
  class LocalDP (line 7) | class LocalDP:
    method __init__ (line 8) | def __init__(self, args, device, module: torch.nn.Module, optimizer: t...
    method _compute_total_para_num (line 54) | def _compute_total_para_num(self):
    method profile_mark_allreduce_start (line 63) | def profile_mark_allreduce_start(self, name=None):
    method profile_mark_allreduce_end (line 70) | def profile_mark_allreduce_end(self, name=None):
    method profile_mark_optimizer_step_start (line 75) | def profile_mark_optimizer_step_start(self):
    method allreduce_parameters (line 79) | def allreduce_parameters(self):
    method rollback_parameters (line 103) | def rollback_parameters(self):
    method optimizer_step (line 113) | def optimizer_step(self):
    method set_time_stamp (line 123) | def set_time_stamp(self, init_time_stamp, init_event):
    method get_ts (line 127) | def get_ts(self, event):
    method profiling_data_parallel (line 130) | def profiling_data_parallel(self, init_time_stamp, init_event):

FILE: training/data_parallel/dist_dp_sharded_ps.py
  class ShardedPSDP (line 9) | class ShardedPSDP:
    method __init__ (line 10) | def __init__(self, args, device, module: torch.nn.Module, optimizer: t...
    method _compute_total_para_num (line 50) | def _compute_total_para_num(self):
    method _declare_grad_buffer (line 59) | def _declare_grad_buffer(self):
    method profile_mark_sync_grad_start (line 66) | def profile_mark_sync_grad_start(self):
    method profile_mark_allreduce_end (line 70) | def profile_mark_allreduce_end(self):
    method profile_mark_optimizer_step_start (line 73) | def profile_mark_optimizer_step_start(self):
    method _sync_gradients (line 77) | def _sync_gradients(self):
    method optimizer_step (line 87) | def optimizer_step(self):
    method set_time_stamp (line 95) | def set_time_stamp(self, init_time_stamp, init_event):
    method get_ts (line 99) | def get_ts(self, event):
    method profiling_data_parallel (line 102) | def profiling_data_parallel(self, init_time_stamp, init_event):

FILE: training/data_parallel/dist_dp_utils.py
  function get_dp_module (line 6) | def get_dp_module(args, device, module, optimizer):

FILE: training/data_parallel/flatten_utils.py
  function _assert_contiguous (line 4) | def _assert_contiguous(tensors):
  function flatten_params (line 12) | def flatten_params(param_set, chunk=None):
  function flatten_tensors (line 55) | def flatten_tensors(tensor_set, chunk=None):

FILE: training/dist_clm_train.py
  function test_loop (line 24) | def test_loop(args, pipe, device, test_data_loader):
  function train_loop (line 76) | def train_loop(args, pipe, device, train_data_loader, test_data_loader, ...
  function calculate_training_steps (line 264) | def calculate_training_steps(args, train_data_loader) -> int:
  function main (line 325) | def main():

FILE: training/dist_prefixlm_train.py
  function test_loop (line 21) | def test_loop(args, pipe, device, test_data_loader):
  function train_loop (line 25) | def train_loop(args, pipe, device, train_data_loader, test_data_loader):
  function main (line 190) | def main():

FILE: training/lora/example/redpajama-incite-chat-3b.py
  function print_trainable_parameters (line 37) | def print_trainable_parameters(model):

FILE: training/modules/deberta_modules.py
  function make_log_bucket_position (line 15) | def make_log_bucket_position(relative_pos, bucket_size, max_position):
  function build_relative_position (line 23) | def build_relative_position(query_size, key_size, bucket_size=-1, max_po...
  class DisentangledSelfAttention (line 35) | class DisentangledSelfAttention(nn.Module):
    method __init__ (line 37) | def __init__(self, config):
    method transpose_for_scores (line 75) | def transpose_for_scores(self, x, attention_heads):
    method forward (line 80) | def forward(
    method disentangled_attention_bias (line 135) | def disentangled_attention_bias(self, query_layer, key_layer, relative...
  class DebertaV2Layers (line 222) | class DebertaV2Layers(_DebertaV2Encoder):
    method __init__ (line 223) | def __init__(self, config, first_block=False):
    method get_rel_pos (line 261) | def get_rel_pos(self, hidden_states, query_states=None, relative_pos=N...
    method forward (line 269) | def forward(
  class DebertaClassificationHead (line 322) | class DebertaClassificationHead(nn.Module):
    method __init__ (line 323) | def __init__(self, config):
    method forward (line 335) | def forward(self, hidden_states, input_ids=None):

FILE: training/modules/dist_deberta_pp_module.py
  class DebertaStageBase (line 5) | class DebertaStageBase(nn.Module):
    method __init__ (line 6) | def __init__(self, args, config):
    method _create_first_layer (line 11) | def _create_first_layer(self):
    method _create_last_layer (line 14) | def _create_last_layer(self):
    method _create_transformer_layers (line 17) | def _create_transformer_layers(self, first_block=False):
  class DebertaStageFirst (line 21) | class DebertaStageFirst(DebertaStageBase):
    method __init__ (line 22) | def __init__(self, args, config, device):
    method forward (line 28) | def forward(self, x, token_type_ids=None, attention_mask=None):
  class DebertaStageMiddle (line 40) | class DebertaStageMiddle(DebertaStageBase):
    method __init__ (line 41) | def __init__(self, args, config, device):
    method forward (line 46) | def forward(self, x, attention_mask=None):
  class DebertaStageLast (line 55) | class DebertaStageLast(DebertaStageBase):
    method __init__ (line 56) | def __init__(self, args, config, device):
    method forward (line 62) | def forward(self, x, attention_mask=None, input_ids=None):

FILE: training/modules/dist_gpt_fsdp_module.py
  class GPTTransformerFsdpLayer (line 10) | class GPTTransformerFsdpLayer(torch.nn.Module):
    method __init__ (line 11) | def __init__(self, model_dim, head_num, feedforward_dim=2048, layer_no...
    method forward (line 32) | def forward(self, x: torch.Tensor) -> torch.Tensor:
  class GPTGlueFsdpModel (line 44) | class GPTGlueFsdpModel(torch.nn.Module):
    method __init__ (line 45) | def __init__(self, args, vocab_size, num_classes, use_checkpoint=True):
    method forward (line 56) | def forward(self, input_ids, position_ids):
  class GPTFsdpStageBase (line 62) | class GPTFsdpStageBase(torch.nn.Module):
    method __init__ (line 63) | def __init__(self, args, num_stage_layers, vocab_size, num_classes, us...
    method _create_first_layer (line 76) | def _create_first_layer(self):
    method _create_last_layer (line 84) | def _create_last_layer(self):
    method _create_fsdp_transformer_layer (line 92) | def _create_fsdp_transformer_layer(self):
  class GPTFsdpStageFirst (line 97) | class GPTFsdpStageFirst(GPTFsdpStageBase):
    method __init__ (line 98) | def __init__(self, args, num_stage_layers, vocab_size, num_classes, de...
    method forward (line 107) | def forward(self, x):
  class GPTFsdpStageMiddle (line 112) | class GPTFsdpStageMiddle(GPTFsdpStageBase):
    method __init__ (line 113) | def __init__(self, args, num_stage_layers, vocab_size, num_classes, de...
    method forward (line 122) | def forward(self, x):
  class GPTFsdpStageLast (line 127) | class GPTFsdpStageLast(GPTFsdpStageBase):
    method __init__ (line 128) | def __init__(self, args, num_stage_layers, vocab_size, num_classes, de...
    method forward (line 138) | def forward(self, x):

FILE: training/modules/dist_gpt_pp_module.py
  class GPTStageBase (line 8) | class GPTStageBase(nn.Module):
    method __init__ (line 9) | def __init__(self, args, config):
    method _create_first_layer (line 45) | def _create_first_layer(self):
    method _create_last_layer (line 60) | def _create_last_layer(self):
    method _create_transformer_layer (line 75) | def _create_transformer_layer(self, layer_idx=0):
  class GPTStageFull (line 92) | class GPTStageFull(GPTStageBase):
    method __init__ (line 93) | def __init__(self, args, config, device):
    method forward (line 105) | def forward(self, x, **kargs):
  class GPTStageFirst (line 111) | class GPTStageFirst(GPTStageBase):
    method __init__ (line 112) | def __init__(self, args, config, device):
    method forward (line 120) | def forward(self, x, **kargs):
  class GPTStageMiddle (line 128) | class GPTStageMiddle(GPTStageBase):
    method __init__ (line 129) | def __init__(self, args, config, device):
    method forward (line 137) | def forward(self, x, **kargs):
  class GPTStageLast (line 145) | class GPTStageLast(GPTStageBase):
    method __init__ (line 146) | def __init__(self, args, config, device):
    method forward (line 162) | def forward(self, x, **kargs):

FILE: training/modules/hf_gpt2_modules.py
  function gpt_loss_func (line 22) | def gpt_loss_func(input, target):
  class GPTEmbeddings (line 30) | class GPTEmbeddings(nn.Module):
    method __init__ (line 31) | def __init__(self, config):
    method forward (line 40) | def forward(self, input_ids, **kargs):
  class GPTAttention (line 61) | class GPTAttention(_GPT2Attention):
    method _attn (line 63) | def _attn(self, query, key, value, attention_mask=None, head_mask=None...
    method forward (line 110) | def forward(
  class GPTBlock (line 165) | class GPTBlock(_GPT2Block):
    method __init__ (line 166) | def __init__(self, config, layer_idx=None, use_checkpoint=True):
    method forward (line 194) | def forward(self, x: torch.Tensor, prefix_masks=None, **kargs) -> torc...
  class GPTModel (line 214) | class GPTModel(_GPT2Model):
    method __init__ (line 215) | def __init__(self, config):
    method forward (line 236) | def forward(self, input_ids, attention_mask=None, **kargs):
  class GPTLMHead (line 270) | class GPTLMHead(nn.Module):
    method __init__ (line 271) | def __init__(self, config):
    method forward (line 276) | def forward(self, x, **kargs):
  class GPTLMHeadModel (line 281) | class GPTLMHeadModel(_GPT2LMHeadModel):
    method __init__ (line 283) | def __init__(self, config):
  class GPTClassificationHead (line 296) | class GPTClassificationHead(nn.Module):
    method __init__ (line 297) | def __init__(self, config):
    method forward (line 303) | def forward(self, hidden_states, input_ids=None):
  class GPTForClassification (line 317) | class GPTForClassification(_GPT2ForSequenceClassification):
    method __init__ (line 319) | def __init__(self, config):

FILE: training/modules/hf_gptj_modules.py
  function gpt_loss_func (line 23) | def gpt_loss_func(input, target):
  function fixed_pos_embedding (line 31) | def fixed_pos_embedding(x, seq_dim=1, seq_len=None):
  class GPTJMLP (line 40) | class GPTJMLP(_GPTJMLP):
    method __init__ (line 41) | def __init__(self, intermediate_size, config, device='cpu'):  # in MLP...
  class GPTJAttention (line 52) | class GPTJAttention(_GPTJAttention):
    method __init__ (line 54) | def __init__(self, config, device='cpu'):
    method _attn (line 87) | def _attn(
    method forward (line 138) | def forward(
  class GPTEmbeddings (line 214) | class GPTEmbeddings(nn.Module):
    method __init__ (line 215) | def __init__(self, config, device='cpu'):
    method from_pretrained (line 223) | def from_pretrained(cls, model_path, config=None):
    method forward (line 236) | def forward(self, input_ids, *args, **kargs):
  class GPTBlock (line 245) | class GPTBlock(_GPTJBlock):
    method __init__ (line 246) | def __init__(self, config, *args, use_checkpoint=True, device='cpu', *...
    method from_pretrained (line 265) | def from_pretrained(cls, model_path, config=None, layer_index=None):
    method forward (line 280) | def forward(self, x: torch.Tensor, prefix_masks=None, layer_past=None,...
  class GPTLMHead (line 318) | class GPTLMHead(nn.Module):
    method __init__ (line 319) | def __init__(self, config, device='cpu'):
    method from_pretrained (line 325) | def from_pretrained(cls, model_path, config=None):
    method forward (line 338) | def forward(self, x, **kargs):

FILE: training/modules/hf_gptneox_modules.py
  class FlashAttentionV2 (line 31) | class FlashAttentionV2(nn.Module):
    method __init__ (line 41) | def __init__(self, softmax_scale=None, attention_dropout=0.0):
    method forward (line 46) | def forward(self, qkv, key_padding_mask=None, causal=False, cu_seqlens...
  function rotate_half (line 73) | def rotate_half(x):
  function apply_rotary_pos_emb (line 80) | def apply_rotary_pos_emb(q, k, cos, sin, offset=0):
  class GPTNeoXAttention (line 98) | class GPTNeoXAttention(_GPTNeoXAttention):
    method __init__ (line 100) | def __init__(self, config):
    method forward (line 130) | def forward(
    method _attn (line 228) | def _attn(self, query, key, value, attention_mask=None, head_mask=None):
  class GPTEmbeddings (line 281) | class GPTEmbeddings(nn.Module):
    method __init__ (line 283) | def __init__(self, config):
    method from_pretrained (line 291) | def from_pretrained(cls, model_path, config=None):
    method forward (line 307) | def forward(self, input_ids, *args, **kargs):
  class GPTBlock (line 316) | class GPTBlock(_GPTNeoXBlock):
    method __init__ (line 318) | def __init__(self, config, *args, use_checkpoint=True, **kargs):
    method from_pretrained (line 355) | def from_pretrained(cls, model_path, config=None, layer_index=None):
    method forward (line 373) | def forward(self,
  class GPTLMHead (line 423) | class GPTLMHead(nn.Module):
    method __init__ (line 425) | def __init__(self, config):
    method from_pretrained (line 434) | def from_pretrained(cls, model_path, config=None):
    method forward (line 450) | def forward(self, x, *args, **kargs):

FILE: training/modules/hf_opt_modules.py
  function _make_causal_mask (line 15) | def _make_causal_mask(
  function _expand_mask (line 38) | def _expand_mask(mask: torch.Tensor, dtype: torch.dtype, tgt_len: Option...
  function _prepare_decoder_attention_mask (line 51) | def _prepare_decoder_attention_mask(attention_mask, input_shape, inputs_...
  class GPTEmbeddings (line 72) | class GPTEmbeddings(nn.Module):
    method __init__ (line 73) | def __init__(self, config, device='cpu'):
    method from_pretrained (line 86) | def from_pretrained(cls, model_path, config=None):
    method forward (line 99) | def forward(self, input_ids, past_layer=None, mask=None, **kargs):
  class OPTAttention (line 143) | class OPTAttention(_OPTAttention):
    method __init__ (line 144) | def __init__(
    method _shape (line 172) | def _shape(self, tensor: torch.Tensor, seq_len: int, bsz: int):
    method forward (line 175) | def forward(
  class GPTBlock (line 295) | class GPTBlock(OPTDecoderLayer):
    method __init__ (line 296) | def __init__(self, config, *args, use_checkpoint=True, device='cpu', *...
    method from_pretrained (line 362) | def from_pretrained(cls, model_path, config=None, layer_index=None):
    method forward (line 377) | def forward(self, x: torch.Tensor, layer_past=None, mask=None, *args, ...
  class GPTLMHead (line 446) | class GPTLMHead(nn.Module):
    method __init__ (line 447) | def __init__(self, config, device='cpu'):
    method from_pretrained (line 463) | def from_pretrained(cls, model_path, config=None):
    method forward (line 476) | def forward(self, x, input_ids=None, *args, **kargs):

FILE: training/modules/llama_modules.py
  class RotaryEmbedding (line 43) | class RotaryEmbedding(torch.nn.Module):
    method __init__ (line 61) | def __init__(
    method _compute_inv_freq (line 109) | def _compute_inv_freq(self, device=None):
    method _update_cos_sin_cache (line 118) | def _update_cos_sin_cache(self, seqlen, device=None, dtype=None):
    method forward (line 169) | def forward(
  class FlashAttentionV2 (line 237) | class FlashAttentionV2(nn.Module):
    method __init__ (line 248) | def __init__(self, softmax_scale=None, attention_dropout=0.0):
    method forward (line 253) | def forward(
  function _make_causal_mask (line 300) | def _make_causal_mask(
  function _make_causal_mask_device (line 321) | def _make_causal_mask_device(
  function _expand_mask (line 351) | def _expand_mask(mask: torch.Tensor, dtype: torch.dtype, tgt_len: Option...
  function _prepare_decoder_attention_mask (line 367) | def _prepare_decoder_attention_mask(
  function rmsnorm_func (line 396) | def rmsnorm_func(hidden_states, weight, variance_epsilon):
  class RMSNorm (line 404) | class RMSNorm(nn.Module):
    method __init__ (line 405) | def __init__(self, hidden_size, eps=1e-6):
    method forward (line 417) | def forward(self, hidden_states):
  class LlamaMLP (line 421) | class LlamaMLP(nn.Module):
    method __init__ (line 422) | def __init__(
    method forward (line 434) | def forward(self, x):
  class LlamaAttention (line 438) | class LlamaAttention(nn.Module):
    method __init__ (line 441) | def __init__(
    method _shape (line 510) | def _shape(self, tensor: torch.Tensor, seq_len: int, bsz: int):
    method forward (line 517) | def forward(
  class LlamaDecoderLayer (line 559) | class LlamaDecoderLayer(nn.Module):
    method __init__ (line 560) | def __init__(self, config: LlamaConfig):
    method forward (line 578) | def forward(
  class GPTEmbeddings (line 632) | class GPTEmbeddings(nn.Module):
    method __init__ (line 633) | def __init__(self, config, device="cpu"):
    method forward (line 643) | def forward(
  class GPTLMHead (line 654) | class GPTLMHead(nn.Module):
    method __init__ (line 655) | def __init__(self, config, device="cpu"):
    method forward (line 663) | def forward(
  class GPTBlock (line 676) | class GPTBlock(nn.Module):
    method __init__ (line 677) | def __init__(self, config: LlamaConfig, *args, **kargs):
    method forward (line 724) | def forward(

FILE: training/modules/task_modules.py
  class GlueClassification (line 4) | class GlueClassification(torch.nn.Module):
    method __init__ (line 5) | def __init__(self, model_dim, num_classes):
    method forward (line 12) | def forward(self, hidden_states, pooler_index=0):

FILE: training/modules/tokenizer.py
  function build_tokenizer (line 4) | def build_tokenizer(args):
  function build_gpt2_tokenizer (line 10) | def build_gpt2_tokenizer(args):
  function build_deberta_tokenizer (line 15) | def build_deberta_tokenizer(args):

FILE: training/modules/utils.py
  function gpt_loss_func (line 10) | def gpt_loss_func(input, target):

FILE: training/optimizer/grad_scalar.py
  class GradScaler (line 7) | class GradScaler(ABC):
    method __init__ (line 8) | def __init__(self, initial_scale, device=None):
    method scale (line 15) | def scale(self):
    method inv_scale (line 19) | def inv_scale(self):
    method update (line 23) | def update(self, found_inf):
    method state_dict (line 27) | def state_dict(self):
    method load_state_dict (line 31) | def load_state_dict(self, state_dict):
  class ConstantGradScaler (line 35) | class ConstantGradScaler(GradScaler):
    method update (line 37) | def update(self, found_inf):
    method state_dict (line 40) | def state_dict(self):
    method load_state_dict (line 43) | def load_state_dict(self, state_dict):
  class DynamicGradScaler (line 47) | class DynamicGradScaler(GradScaler):
    method __init__ (line 49) | def __init__(self, initial_scale, min_scale,
    method update (line 79) | def update(self, found_inf):
    method state_dict (line 102) | def state_dict(self):
    method load_state_dict (line 109) | def load_state_dict(self, state_dict):

FILE: training/optimizer/optimizer.py
  function _has_overflow_serial (line 7) | def _has_overflow_serial(grads):
  function _zero_grad_group (line 40) | def _zero_grad_group(group, set_to_none):
  class Fp16Optimizer (line 62) | class Fp16Optimizer:
    method __init__ (line 64) | def __init__(self, optimizer, grad_scaler, device, offload=False):
    method zero_grad (line 113) | def zero_grad(self, set_to_none=True):
    method get_loss_scale (line 120) | def get_loss_scale(self):
    method _copy_model_grads_to_optimizer_grads (line 123) | def _copy_model_grads_to_optimizer_grads(self):
    method _unscale_optimizer_grads_and_check_for_nan (line 138) | def _unscale_optimizer_grads_and_check_for_nan(self):
    method _get_model_and_optimizer_params_data_float16_deprecated (line 157) | def _get_model_and_optimizer_params_data_float16_deprecated(self):
    method _copy_optimizer_params_to_model_params (line 166) | def _copy_optimizer_params_to_model_params(self):
    method _copy_model_params_to_optimizer_params (line 179) | def _copy_model_params_to_optimizer_params(self):
    method reload_model_params (line 191) | def reload_model_params(self):
    method step (line 195) | def step(self):
    method scale (line 216) | def scale(self, z):
    method unscale (line 219) | def unscale(self, z):
    method state_dict (line 222) | def state_dict(self):
    method load_state_dict (line 225) | def load_state_dict(self, state_dict):
  function get_fp16_optimizer (line 229) | def get_fp16_optimizer(args, optimizer, device):

FILE: training/pipeline_parallel/dist_gpipe_pipeline_async.py
  function get_parameter_names (line 16) | def get_parameter_names(model, forbidden_layer_types):
  function create_optimizer (line 32) | def create_optimizer(model, optimizer_type, weight_decay=0.01, learning_...
  class GpipeAsync (line 67) | class GpipeAsync:
    method __init__ (line 78) | def __init__(self, args, config, device, use_dp=False,
    method _compute_micro_batch_size (line 225) | def _compute_micro_batch_size(self):
    method zero_input_grad (line 236) | def zero_input_grad(self):
    method profile_mark_forward_comp_start (line 242) | def profile_mark_forward_comp_start(self, i):
    method profile_mark_forward_recv_start (line 247) | def profile_mark_forward_recv_start(self, i):
    method profile_mark_forward_send_start (line 252) | def profile_mark_forward_send_start(self, i):
    method profile_mark_forward_send_end (line 257) | def profile_mark_forward_send_end(self, i):
    method profile_mark_backward_comp_start (line 262) | def profile_mark_backward_comp_start(self, i):
    method profile_mark_backward_recv_start (line 267) | def profile_mark_backward_recv_start(self, i):
    method profile_mark_backward_send_start (line 272) | def profile_mark_backward_send_start(self, i):
    method profile_mark_backward_send_end (line 277) | def profile_mark_backward_send_end(self, i):
    method get_ts (line 282) | def get_ts(self, event):
    method forward_stage (line 285) | def forward_stage(self, input_data=None, aux_input_data=None):
    method profiling_forward_stage (line 388) | def profiling_forward_stage(self):
    method backward_stage (line 417) | def backward_stage(self, cached_output_micro_batches: List[torch.Tenso...
    method profiling_backward_stage (line 521) | def profiling_backward_stage(self):
    method save_on_disk (line 549) | def save_on_disk(self, path):
    method optimizer_step (line 553) | def optimizer_step(self):
    method profiling_optimizer_step (line 574) | def profiling_optimizer_step(self):
    method export_profiling_result (line 587) | def export_profiling_result(self, filename):
    method sgd_iter (line 591) | def sgd_iter(self, input_=None, target=None,
    method infer_stage (line 656) | def infer_stage(self, input_data=None, aux_input_data=None,
    method infer_iter (line 736) | def infer_iter(self, input_=None, target=None,

FILE: training/pipeline_parallel/dist_pp_utils.py
  function get_pp_module (line 4) | def get_pp_module(args, config, device, use_dp):

FILE: training/tasks/data_loaders/data_utils.py
  function random_chunk (line 32) | def random_chunk(li, min_chunk=1, max_chunk=5):
  class UL2RProcessor (line 42) | class UL2RProcessor:
    method __init__ (line 48) | def __init__(self, tokenizer, seq_length=1024):
    method preprocess_tokens_s2s (line 59) | def preprocess_tokens_s2s(self, tokens):
    method preprocess_tokens_nlg (line 76) | def preprocess_tokens_nlg(self, tokens):
    method preprocess_tokens_nlu (line 98) | def preprocess_tokens_nlu(self, tokens):
    method preprocess_ul2r (line 136) | def preprocess_ul2r(self, inputs):
    method preprocess_random (line 146) | def preprocess_random(self, inputs):
    method __call__ (line 168) | def __call__(self, inputs):
  class StreamDataset (line 175) | class StreamDataset(IterableDataset):
    method __init__ (line 177) | def __init__(self, data, tokenizer, seq_length=1024, doc_separator=Non...
    method state_dict (line 187) | def state_dict(self):
    method load_state_dict (line 190) | def load_state_dict(self, state_dict):
    method get_sequence (line 193) | def get_sequence(self):
    method get_stream (line 208) | def get_stream(self):
    method __iter__ (line 214) | def __iter__(self):
  class StreamDatasetList (line 220) | class StreamDatasetList(IterableDataset):
    method __init__ (line 221) | def __init__(self, task_names, datasets, sample_probs, tokenizer, seq_...
    method state_dict (line 234) | def state_dict(self):
    method load_state_dict (line 237) | def load_state_dict(self, state_dict):
    method get_sequence (line 240) | def get_sequence(self):
    method get_stream (line 268) | def get_stream(self):
    method __iter__ (line 271) | def __iter__(self):
    method tokenize_function (line 276) | def tokenize_function(self, examples):
    method get_dataset_token_count (line 285) | def get_dataset_token_count(self) -> int:
    method get_dataset_example_count (line 314) | def get_dataset_example_count(self) -> int:
  function name_to_dataset (line 329) | def name_to_dataset(task, tokenizer, args):
  function name_to_dataset_eval (line 343) | def name_to_dataset_eval(task, tokenizer, args):
  function get_train_data_loader (line 352) | def get_train_data_loader(args, tokenizer, num_workers=1, state_dict=None):
  function get_eval_data_loader (line 407) | def get_eval_data_loader(args, tokenizer, num_workers=1, state_dict=None):
  function get_ul2r_train_data_loader (line 434) | def get_ul2r_train_data_loader(args, tokenizer, num_workers=1, state_dic...

FILE: training/tasks/data_loaders/prosocial.py
  class StreamDataset (line 14) | class StreamDataset(IterableDataset):
    method __init__ (line 15) | def __init__(self, dataset, tokenizer, seq_length=1024):
    method state_dict (line 25) | def state_dict(self):
    method load_state_dict (line 30) | def load_state_dict(self, state_dict):
    method get_sequence (line 34) | def get_sequence(self):
    method get_stream (line 66) | def get_stream(self):
    method __iter__ (line 69) | def __iter__(self):

FILE: training/utils/dist_args_utils.py
  function add_device_arguments (line 1) | def add_device_arguments(parser):
  function add_torch_distributed_arguments (line 12) | def add_torch_distributed_arguments(parser):
  function add_task_arguments (line 29) | def add_task_arguments(parser):
  function add_model_arguments (line 46) | def add_model_arguments(parser):
  function add_training_hyper_parameter_arguments (line 57) | def add_training_hyper_parameter_arguments(parser):
  function add_mixed_precision_arguments (line 72) | def add_mixed_precision_arguments(parser):
  function add_parallel_schema_arguments (line 90) | def add_parallel_schema_arguments(parser):
  function get_model_arguments_str (line 99) | def get_model_arguments_str(args):
  function get_dist_arguments_str (line 103) | def get_dist_arguments_str(args, add_rank=True):
  function get_learning_arguments_str (line 111) | def get_learning_arguments_str(args):
  function get_mixed_precision_arguments_str (line 115) | def get_mixed_precision_arguments_str(args):

FILE: training/utils/dist_checkpoint_utils.py
  function load_checkpoint (line 11) | def load_checkpoint(pipe, args):
  function save_checkpoint (line 64) | def save_checkpoint(pipe, args) -> str:
  function save_stream_dataloader_state_dict (line 107) | def save_stream_dataloader_state_dict(dataloader, pipe, args):
  function load_stream_dataloader_state_dict (line 121) | def load_stream_dataloader_state_dict(dataloader, pipe, args):

FILE: training/utils/dist_debug_utils.py
  function print_cuda_memory (line 4) | def print_cuda_memory(args, info: str, device=None):
  function print_multi_cuda_memory (line 12) | def print_multi_cuda_memory(args, info: str):

FILE: training/utils/event_report.py
  class EventReporter (line 28) | class EventReporter:
    method __init__ (line 75) | def __init__(self, host=None, auth_token=None, job_id=None):
    method is_enabled (line 80) | def is_enabled(self) -> bool:
    method report (line 114) | def report(self, object, message, event_type,
  function add_entry_reporter_arguments (line 188) | def add_entry_reporter_arguments(parser):
  function main (line 195) | def main():

FILE: training/utils/logging_utils.py
  function init_train_logger (line 19) | def init_train_logger(args):
  function train_log (line 46) | def train_log(x, *args, **kargs):

FILE: training/utils/upload_manager.py
  class UploadManager (line 11) | class UploadManager:
    method __init__ (line 12) | def __init__(self, aws_endpoint_url: str, aws_access_key_id: str,
    method add_task (line 43) | def add_task(self, directory: str, checkpoint_upload_prefix: str, step...
    method wait (line 58) | def wait(self):
    method _report_event (line 62) | def _report_event(self, **kwargs):
    method _wait_for_file_write_to_finish (line 66) | def _wait_for_file_write_to_finish(self, file_path: str, wait_start_ti...
    method _execute_task (line 81) | def _execute_task(self, directory, s3_bucket, s3_key_prefix, step: int):
  function add_aws_arguments (line 184) | def add_aws_arguments(parser: argparse.ArgumentParser):
  function aws_process_args (line 191) | def aws_process_args(args: argparse.Namespace, required: bool = False):
  function main (line 207) | def main():

Download .json

Condensed preview — 83 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (469K chars).

[
  {
    "path": ".github/ISSUE_TEMPLATE/bug_report.md",
    "chars": 834,
    "preview": "---\nname: Bug report\nabout: Create a report to help us improve\ntitle: ''\nlabels: ''\nassignees: ''\n\n---\n\n**Describe the b"
  },
  {
    "path": ".github/ISSUE_TEMPLATE/feature_request.md",
    "chars": 595,
    "preview": "---\nname: Feature request\nabout: Suggest an idea for this project\ntitle: ''\nlabels: ''\nassignees: ''\n\n---\n\n**Is your fea"
  },
  {
    "path": ".github/ISSUE_TEMPLATE/openchatkit-feedback-report.yaml",
    "chars": 978,
    "preview": "name: OpenChatKit Feedback Report\ndescription: Details of feedback from using OpenChatKit test app\ntitle: OpenChatKit Fe"
  },
  {
    "path": ".gitignore",
    "chars": 2263,
    "preview": "# Byte-compiled / optimized / DLL files\n__pycache__/\n*.py[cod]\n*$py.class\n\n# C extensions\n*.so\n\n# Distribution / packagi"
  },
  {
    "path": "LICENSE",
    "chars": 12486,
    "preview": "                                 Apache License\n                           Version 2.0, January 2004\n                   "
  },
  {
    "path": "README.md",
    "chars": 14407,
    "preview": "# OpenChatKit\n\nOpenChatKit provides a powerful, open-source base to create both specialized and general purpose models f"
  },
  {
    "path": "data/OIG/prepare.py",
    "chars": 351,
    "preview": "import sys\nimport os\n\n# Import the prepare_data function\ncurrent_dir = os.path.dirname(os.path.abspath(__file__))\nsys.pa"
  },
  {
    "path": "data/OIG-chip2/prepare.sh",
    "chars": 180,
    "preview": "DIR=$(cd -- \"$( dirname -- \"${BASH_SOURCE[0]}\" )\" &> /dev/null && pwd)\nwget https://huggingface.co/datasets/laion/OIG/re"
  },
  {
    "path": "data/OIG-moderation/prepare.py",
    "chars": 365,
    "preview": "import sys\nimport os\n\n# Import the prepare_data function\ncurrent_dir = os.path.dirname(os.path.abspath(__file__))\nsys.pa"
  },
  {
    "path": "data/prepare_data.py",
    "chars": 10740,
    "preview": "import argparse\nfrom shutil import copyfile\nimport boto3\nimport botocore\nimport glob\nimport gzip\nimport os\nimport re\nimp"
  },
  {
    "path": "data/wikipedia-3sentence-level-retrieval-index/prepare.py",
    "chars": 402,
    "preview": "import sys\nimport os\n\n# Import the prepare_data function\ncurrent_dir = os.path.dirname(os.path.abspath(__file__))\nsys.pa"
  },
  {
    "path": "docs/GPT-NeoXT-Chat-Base-20B.md",
    "chars": 9817,
    "preview": "# GPT-NeoXT-Chat-Base-20B\n\nOpenChatKit includes an instruction-tuned 20 billion parameter language model called GPT-NeoX"
  },
  {
    "path": "docs/finetuning-RedPajama-3B.md",
    "chars": 3827,
    "preview": "# RedPajama-3B\n\nIn this tutorial, you will learn how to fine-tune a base LLM on a sample of data. By the end of \nthe tut"
  },
  {
    "path": "environment.yml",
    "chars": 612,
    "preview": "name: OpenChatKit\nchannels:\n  - pytorch\n  - nvidia\n  - conda-forge\n  - defaults\ndependencies:\n  - cudatoolkit=11.8.0\n  -"
  },
  {
    "path": "inference/README.md",
    "chars": 5235,
    "preview": "# OpenChatKit Inference\nThis directory contains code for OpenChatKit's inference.\n\n## Arguments\n- `--gpu-id`: Primary GP"
  },
  {
    "path": "inference/bot.py",
    "chars": 9109,
    "preview": "import os\nimport sys\n\nINFERENCE_DIR = os.path.dirname(os.path.abspath(__file__))\n\n# TODO: PYTHONPATH hacks are never a g"
  },
  {
    "path": "inference/conversation.py",
    "chars": 1603,
    "preview": "import re\nimport time\n\nMEANINGLESS_WORDS = ['<pad>', '</s>', '<|endoftext|>']\nPRE_PROMPT = \"\"\"\\\nCurrent Date: {}\nCurrent"
  },
  {
    "path": "pretrained/GPT-NeoX-20B/prepare.py",
    "chars": 402,
    "preview": "import sys\nimport os\n\n# Import the prepare_data function\ncurrent_dir = os.path.dirname(os.path.abspath(__file__))\nsys.pa"
  },
  {
    "path": "pretrained/Llama-2-7B-32K-beta/prepare.py",
    "chars": 2548,
    "preview": "import os\nimport argparse\nimport torch\nfrom transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig\n\nDIR = o"
  },
  {
    "path": "pretrained/Pythia-6.9B-deduped/prepare.py",
    "chars": 409,
    "preview": "import sys\nimport os\n\n# Import the prepare_data function\ncurrent_dir = os.path.dirname(os.path.abspath(__file__))\nsys.pa"
  },
  {
    "path": "pretrained/RedPajama-3B/prepare.py",
    "chars": 423,
    "preview": "import os\nimport sys\n\n# Import the prepare_data function\ncurrent_dir = os.path.dirname(os.path.abspath(__file__))\nsys.pa"
  },
  {
    "path": "pretrained/RedPajama-7B/prepare.py",
    "chars": 2663,
    "preview": "import os\nimport argparse\nimport torch\nfrom transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig\n\nDIR = o"
  },
  {
    "path": "pretrained/prepare_pretrained.py",
    "chars": 3064,
    "preview": "import os\nimport argparse\nimport torch\nfrom transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig\n\nDIR = o"
  },
  {
    "path": "retrieval/README.md",
    "chars": 2268,
    "preview": "# Retrieval-Enhanced Chatbot\n\nThis is a demonstration of how to enhance a chatbot using Wikipedia. We'll be using [Chris"
  },
  {
    "path": "retrieval/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "retrieval/wikipedia.py",
    "chars": 3531,
    "preview": "# This file was adapted from ChristophSchuhmann/wikipedia-3sentence-level-retrieval-index:\n#   https://huggingface.co/da"
  },
  {
    "path": "tools/README.md",
    "chars": 2451,
    "preview": "# OpenChatKit Tools\n\n## convert_to_hf_gptneox.py\n\n## ml_load_benchmark.py\n\nThe commands to run the model load benchmark "
  },
  {
    "path": "tools/benchmark_input.json",
    "chars": 319,
    "preview": "{\n    \"GPT-NeoXT-Chat-Base-20B\": \"togethercomputer/GPT-NeoXT-Chat-Base-20B\",\n    \"Pythia-Chat-Base-7B\": \"togethercompute"
  },
  {
    "path": "tools/convert_to_hf_gptneox.py",
    "chars": 5161,
    "preview": "import torch\nimport torch.nn as nn\n\nimport argparse\n\nfrom transformers import GPTNeoXForCausalLM\n\nfrom transformers impo"
  },
  {
    "path": "tools/convert_to_hf_llama.py",
    "chars": 6275,
    "preview": "import os\nimport argparse\nimport torch\n\nimport torch\nimport torch.nn as nn\n\nfrom transformers import LlamaForCausalLM\nfr"
  },
  {
    "path": "tools/model_load_benchmark.py",
    "chars": 7289,
    "preview": "import argparse\nimport json\nimport time\nimport torch\nimport torchvision\nimport os\nimport re\nimport psutil\nfrom transform"
  },
  {
    "path": "training/README.md",
    "chars": 4273,
    "preview": "# OpenChatKit Training\n\nThis directory contains code for training a chat model using OpenChatKit. The main training scri"
  },
  {
    "path": "training/comm/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "training/comm/comm_utils.py",
    "chars": 8737,
    "preview": "from .torch_backend import *\nfrom .nccl_backend import *\n\n_DATA_PARALLEL_COMM = None\n_DATA_PARALLEL_RANK = None\n_DATA_PA"
  },
  {
    "path": "training/comm/nccl_backend.py",
    "chars": 7248,
    "preview": "import torch\nimport numpy as np\nimport cupy\nimport cupy.cuda.nccl\nimport torch.distributed as dist\nfrom typing import Li"
  },
  {
    "path": "training/comm/torch_backend.py",
    "chars": 4040,
    "preview": "import torch\nimport torch.distributed as dist\nfrom typing import List\n\nclass TorchCommunicator:\n        \n    def __init_"
  },
  {
    "path": "training/data_parallel/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "training/data_parallel/dist_dp_allreduce.py",
    "chars": 7185,
    "preview": "import torch.cuda\nfrom comm.comm_utils import *\nfrom .flatten_utils import flatten_params\n\n\nclass AllReduceDP:\n    def _"
  },
  {
    "path": "training/data_parallel/dist_dp_central_ps.py",
    "chars": 10343,
    "preview": "import torch.cuda\nfrom comm.comm_utils import *\nfrom .flatten_utils import flatten_params\n\n\nclass CentralPSDP:\n    def _"
  },
  {
    "path": "training/data_parallel/dist_dp_local.py",
    "chars": 8116,
    "preview": "import torch.cuda\nimport cupy\nfrom comm.comm_utils import *\nfrom .flatten_utils import flatten_params\n\n\nclass LocalDP:\n "
  },
  {
    "path": "training/data_parallel/dist_dp_sharded_ps.py",
    "chars": 5832,
    "preview": "import torch\nimport torch.nn as nn\nimport torch.nn.functional as F\nimport torch.cuda\nfrom comm.comm_utils import *\nfrom "
  },
  {
    "path": "training/data_parallel/dist_dp_utils.py",
    "chars": 711,
    "preview": "from .dist_dp_allreduce import AllReduceDP\nfrom .dist_dp_sharded_ps import ShardedPSDP\nfrom .dist_dp_local import LocalD"
  },
  {
    "path": "training/data_parallel/flatten_utils.py",
    "chars": 2914,
    "preview": "import torch\n\n\ndef _assert_contiguous(tensors):\n    data_ptr = None\n    for t in tensors:\n        if data_ptr is not Non"
  },
  {
    "path": "training/dist_clm_train.py",
    "chars": 19959,
    "preview": "import argparse\nimport time\nimport random\nimport numpy as np\nimport torch\nimport torch.autograd.profiler as profiler\nfro"
  },
  {
    "path": "training/dist_prefixlm_train.py",
    "chars": 13688,
    "preview": "import argparse\nimport time\nimport random\nimport numpy as np\nimport torch\nimport torch.autograd.profiler as profiler\nfro"
  },
  {
    "path": "training/finetune_GPT-NeoXT-Chat-Base-20B.sh",
    "chars": 3467,
    "preview": "DIR=$(cd -- \"$( dirname -- \"${BASH_SOURCE[0]}\" )\" &> /dev/null && pwd)\n\nnetif=lo\nexport GLOO_SOCKET_IFNAME=${netif}\nexpo"
  },
  {
    "path": "training/finetune_Pythia-Chat-Base-7B.sh",
    "chars": 3476,
    "preview": "DIR=$(cd -- \"$( dirname -- \"${BASH_SOURCE[0]}\" )\" &> /dev/null && pwd)\n\nnetif=lo\nexport GLOO_SOCKET_IFNAME=${netif}\nexpo"
  },
  {
    "path": "training/finetune_RedPajama-INCITE-7B-Chat.sh",
    "chars": 1930,
    "preview": "DIR=$(cd -- \"$( dirname -- \"${BASH_SOURCE[0]}\" )\" &> /dev/null && pwd)\n\nnetif=lo\nexport GLOO_SOCKET_IFNAME=${netif}\nexpo"
  },
  {
    "path": "training/finetune_RedPajama-INCITE-Chat-3B-v1.sh",
    "chars": 1933,
    "preview": "DIR=$(cd -- \"$( dirname -- \"${BASH_SOURCE[0]}\" )\" &> /dev/null && pwd)\n\nnetif=lo\nexport GLOO_SOCKET_IFNAME=${netif}\nexpo"
  },
  {
    "path": "training/finetune_llama-2-7b-32k-booksum.sh",
    "chars": 1987,
    "preview": "DIR=$(cd -- \"$( dirname -- \"${BASH_SOURCE[0]}\" )\" &> /dev/null && pwd)\n\nnetif=lo\nexport GLOO_SOCKET_IFNAME=${netif}\nexpo"
  },
  {
    "path": "training/finetune_llama-2-7b-32k-mqa.sh",
    "chars": 2005,
    "preview": "DIR=$(cd -- \"$( dirname -- \"${BASH_SOURCE[0]}\" )\" &> /dev/null && pwd)\n\nnetif=lo\nexport GLOO_SOCKET_IFNAME=${netif}\nexpo"
  },
  {
    "path": "training/lora/example/redpajama-incite-chat-3b.py",
    "chars": 2582,
    "preview": "import os\nimport json\nos.environ[\"CUDA_VISIBLE_DEVICES\"]=\"0\"\nimport torch\nimport transformers\nimport torch.nn as nn\nimpo"
  },
  {
    "path": "training/lora/example/redpajama-incite-chat-3b_inference.py",
    "chars": 761,
    "preview": "import torch\nfrom peft import PeftModel, PeftConfig\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\n\npeft_m"
  },
  {
    "path": "training/modules/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "training/modules/deberta_modules.py",
    "chars": 14984,
    "preview": "import torch\nimport numpy as np\nimport math\nfrom torch import nn\nfrom torch.nn import functional\nfrom torch.utils.checkp"
  },
  {
    "path": "training/modules/dist_deberta_pp_module.py",
    "chars": 2700,
    "preview": "from torch import nn\nfrom .deberta_modules import DebertaV2Embeddings, DebertaV2Layers, DebertaClassificationHead\n\n\nclas"
  },
  {
    "path": "training/modules/dist_gpt_fsdp_module.py",
    "chars": 6498,
    "preview": "import torch\nfrom fairscale.nn.data_parallel import FullyShardedDataParallel as FSDP\nfrom .task_modules import GlueClass"
  },
  {
    "path": "training/modules/dist_gpt_pp_module.py",
    "chars": 7017,
    "preview": "import numpy as np\nfrom torch import nn\nfrom comm.comm_utils import *\n\nfrom copy import deepcopy\n\n\nclass GPTStageBase(nn"
  },
  {
    "path": "training/modules/hf_gpt2_modules.py",
    "chars": 13268,
    "preview": "import torch\nimport math\nimport numpy as np\nfrom torch import nn\nfrom torch.nn import functional\nfrom torch.utils.checkp"
  },
  {
    "path": "training/modules/hf_gptj_modules.py",
    "chars": 13365,
    "preview": "import os\nimport torch\nimport math\nimport numpy as np\nfrom torch import nn\nfrom torch.nn import functional\nfrom torch.ut"
  },
  {
    "path": "training/modules/hf_gptneox_modules.py",
    "chars": 17358,
    "preview": "import os\nimport torch\nimport numpy as np\nfrom torch import nn\nfrom torch.nn import functional\nfrom torch.utils.checkpoi"
  },
  {
    "path": "training/modules/hf_opt_modules.py",
    "chars": 20210,
    "preview": "from typing import List, Optional, Tuple, Union\n\nimport os\nimport torch\nfrom torch import nn\nfrom torch.utils.checkpoint"
  },
  {
    "path": "training/modules/llama_modules.py",
    "chars": 26669,
    "preview": "# Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved.\n#\n# This code is based on EleutherAI's G"
  },
  {
    "path": "training/modules/task_modules.py",
    "chars": 583,
    "preview": "import torch\n\n\nclass GlueClassification(torch.nn.Module):\n    def __init__(self, model_dim, num_classes):\n        super("
  },
  {
    "path": "training/modules/tokenizer.py",
    "chars": 585,
    "preview": "\nfrom transformers import AutoTokenizer, GPT2TokenizerFast, DebertaV2Tokenizer\n\ndef build_tokenizer(args):\n    tokenizer"
  },
  {
    "path": "training/modules/utils.py",
    "chars": 456,
    "preview": "import torch\nimport math\nimport numpy as np\nfrom torch import nn\nfrom torch.nn import functional\nfrom typing import Opti"
  },
  {
    "path": "training/optimizer/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "training/optimizer/grad_scalar.py",
    "chars": 3891,
    "preview": "from abc import ABC\nfrom abc import abstractmethod\n\nimport torch\n\n\nclass GradScaler(ABC):\n    def __init__(self, initial"
  },
  {
    "path": "training/optimizer/optimizer.py",
    "chars": 10401,
    "preview": "import torch\nfrom .grad_scalar import *\n\n# This follows some implementation from Megatron\n\n\ndef _has_overflow_serial(gra"
  },
  {
    "path": "training/pipeline_parallel/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "training/pipeline_parallel/dist_gpipe_pipeline_async.py",
    "chars": 38291,
    "preview": "import time\nimport json\nimport torch.nn.functional\nfrom torch import optim\nfrom comm.comm_utils import *\nfrom modules.di"
  },
  {
    "path": "training/pipeline_parallel/dist_pp_utils.py",
    "chars": 294,
    "preview": "from .dist_gpipe_pipeline_async import GpipeAsync\n\n\ndef get_pp_module(args, config, device, use_dp):\n    \n    if args.pp"
  },
  {
    "path": "training/tasks/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "training/tasks/data_loaders/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "training/tasks/data_loaders/data_utils.py",
    "chars": 15482,
    "preview": "import os\nimport re\nimport torch\nimport json\nimport numpy as np\nfrom torch.utils.data import IterableDataset, DataLoader"
  },
  {
    "path": "training/tasks/data_loaders/prosocial.py",
    "chars": 1848,
    "preview": "import os\nimport re\nimport torch\nimport json\nfrom torch.utils.data import IterableDataset, DataLoader\nfrom itertools imp"
  },
  {
    "path": "training/utils/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "training/utils/dist_args_utils.py",
    "chars": 6324,
    "preview": "def add_device_arguments(parser):\n    parser.add_argument('--use-cuda', default=True, type=lambda x: (str(x).lower() == "
  },
  {
    "path": "training/utils/dist_checkpoint_utils.py",
    "chars": 3847,
    "preview": "import os\nimport time\nimport random\nimport json\nimport numpy as np\nimport torch\n\nfrom comm.comm_utils import *\n\n\ndef loa"
  },
  {
    "path": "training/utils/dist_debug_utils.py",
    "chars": 794,
    "preview": "import torch\n\n\ndef print_cuda_memory(args, info: str, device=None):\n    if args.debug_mem:\n        if device is None:\n  "
  },
  {
    "path": "training/utils/event_report.py",
    "chars": 11138,
    "preview": "#!/usr/bin/env python3\n\n# This application reports events that are stored in the event log REST service.\n# Events will b"
  },
  {
    "path": "training/utils/logging_utils.py",
    "chars": 1336,
    "preview": "import os\n\ntry:\n    import wandb\n    _has_wandb = True\nexcept:\n    _has_wandb = False\n    print(\"wandb is not installed."
  },
  {
    "path": "training/utils/upload_manager.py",
    "chars": 11853,
    "preview": "import argparse\nimport boto3\nimport concurrent.futures\nimport os\nimport re\nimport sys\nimport time\n\nfrom utils.event_repo"
  }
]

About this extraction

This page contains the full source code of the togethercomputer/OpenChatKit GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 83 files (440.4 KB), approximately 104.5k tokens, and a symbol index with 475 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Extract another repo