Showing preview only (233K chars total). Download the full file or copy to clipboard to get everything.
Repository: volcengine/veGiantModel
Branch: main
Commit: a5cd7f006fc1
Files: 22
Total size: 223.2 KB
Directory structure:
gitextract_cagnoq27/
├── .gitignore
├── .gitmodules
├── LICENSE
├── README.md
├── docs/
│ ├── Dockerfile
│ └── step-by-step-tutorial.md
├── examples/
│ └── gpt/
│ ├── gpt_piped.py
│ ├── initialize.py
│ ├── pretrain_gpt2.py
│ └── pretrain_gpt2_distributed.sh
└── src/
└── veGiantModel/
├── __init__.py
├── distributed/
│ └── __init__.py
├── engine/
│ ├── engine.py
│ ├── module.py
│ ├── p2p.py
│ ├── schedule.py
│ └── topology.py
├── initialize.py
├── launcher/
│ └── launch.py
├── module/
│ ├── __init__.py
│ └── dense.py
└── patcher.py
================================================
FILE CONTENTS
================================================
================================================
FILE: .gitignore
================================================
application/cache
*.pyc
# general things to ignore
build/
dist/
*.egg-info/
*.egg
*.py[cod]
__pycache__/
*~
# due to using tox and pytest
.tox
.cache
================================================
FILE: .gitmodules
================================================
[submodule "third_party/megatron"]
path = third_party/megatron
url = https://github.com/NVIDIA/Megatron-LM.git
[submodule "third_party/deepspeed"]
path = third_party/deepspeed
url = https://github.com/microsoft/DeepSpeed.git
================================================
FILE: LICENSE
================================================
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "[]"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright [yyyy] [name of copyright owner]
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
================================================
FILE: README.md
================================================
# veGiantModel
VeGiantModel is a torch based high efficient training library developed by the Applied Machine Learning team at Bytedance. This repository is for ongoing research to make giant model (such as [GPT](https://arxiv.org/abs/2005.14165), [BERT](https://arxiv.org/pdf/1810.04805.pdf) and [T5](https://arxiv.org/abs/1910.10683)) training easy, efficient, and effective. VeGiantModel builds on top of [Megatron](https://github.com/NVIDIA/Megatron-LM) and [DeepSpeed](https://github.com/microsoft/DeepSpeed), improves communication efficiency by integrating high efficient communication library [BytePs](https://github.com/bytedance/byteps) and providing customized pipline partitioning.
## initialization
```python
import veGiantModel
pipeline_parallel_size = 1
model_parallel_size = 2
veGiantModel.initialize.init_distribute(pipeline_parallel_size, model_parallel_size, init_method="env://")
mp_size = veGiantModel.distributed.get_model_parallel_world_size()
dp_size = veGiantModel.distributed.get_data_parallel_world_size()
```
## modules
```python
from veGiantModel.module import ColumnParallelLinear, RowParallelLinear
class PositionWiseFeedForward(nn.Module):
""" FeedForward Neural Networks for each position """
def __init__(self, config: Config):
super().__init__()
if self.config.use_mp_linear_in_ffn:
assert ColumnParallelLinear is not None
assert RowParallelLinear is not None
self.fc1 = ColumnParallelLinear(config.dim, config.dim_ff, use_ft=False)
self.fc2 = RowParallelLinear(config.dim_ff, config.dim, use_ft=False)
else:
self.fc1 = nn.Linear(config.dim, config.dim_ff)
self.fc2 = nn.Linear(config.dim_ff, config.dim)
self.act = Activation(config.act)
self.dropout = nn.Dropout(config.p_drop_hidden)
def forward(self, x) -> torch.Tensor:
# (bsz, seq_len, dim) -> (bsz, seq_len, dim_ff / model_parallel_size) -> (bsz, seq_len, dim)
fc1_out = self.act(self.fc1(x))
if self.config.dropout_in_ffn:
fc1_out = self.dropout(fc1_out)
fc2_out = self.fc2(fc1_out)
if self.config.use_ffn_output_dropout:
fc2_out = self.dropout(fc2_out)
return fc2_out
```
## Examples
### GPT Pretraining
The `examples/gpt/pretrain_gpt2_distributed.sh` scrips runs 345M parameter GPT pretraining on single 8 GPUs node. It follows largely the same as Megatron GPT script with a few notable differences. It shows good compatiblility with current megatron/Deepseed training job with little changes to adpot VeGiantModel.
================================================
FILE: docs/Dockerfile
================================================
FROM nvcr.io/nvidia/pytorch:21.05-py3
RUN pip3 install boto3 regex tensorboardX==1.8 wheel pybind11 ninja psutil pyprof
RUN apt-get -yq autoremove --purge ibverbs-providers
RUN apt-get update && \
DEBIAN_FRONTEND=noninteractive apt-get install -yq --no-install-recommends --allow-downgrades \
libibverbs-dev=28.0-1ubuntu1 libibverbs1=28.0-1ubuntu1
RUN apt-get update && \
DEBIAN_FRONTEND=noninteractive apt-get install -yq --no-install-recommends --allow-downgrades \
cmake \
libopenmpi-dev \
openmpi-bin \
openssh-client \
openssh-server \
ibverbs-providers \
libibverbs-dev=28.0-1ubuntu1 \
librdmacm-dev \
vim \
iputils-ping \
llvm-10-dev \
iproute2 \
unzip
RUN ln -s /usr/bin/aclocal-1.16 /usr/local/bin/aclocal-1.14
RUN ln -s /usr/bin/automake /usr/local/bin/automake-1.14
ENV LD_LIBRARY_PATH "/usr/lib/x86_64-linux-gnu:${LD_LIBRARY_PATH}"
ENV BYTEPS_WITH_UCX 0
#install byteps from package stored in tos at volcengine
# RUN pip3 install https://giant-model-package.tos-cn-beijing.volces.com/byteps-0.7.2-cp38-cp38-linux_x86_64.whl
#install byteps from source
RUN git clone --recursive -b bccl-github https://github.com/bytedance/byteps.git && \
cd byteps && python3 setup.py install
WORKDIR /root
================================================
FILE: docs/step-by-step-tutorial.md
================================================
# A Step-by-Step Tutorial
The goal of this tutorial is to help you run the example quickly.
## Pre-requisite
pytorch:
```
pip3 install pytorch
```
Apex:
```
git clone https://github.com/NVIDIA/apex.git
cd apex
python3 setup.py -v --cpp_ext --cuda_ext bdist_wheel
sudo pip3 install dist/*
```
BytePs:
```
git clone --recursive -b bccl-github https://github.com/bytedance/byteps.git
cd byteps
python3 setup.py install
```
## Prepare data
[GPT data preprocess](https://github.com/NVIDIA/Megatron-LM#data-preprocessing)
## Setup veGiantModel
```
git clone https://github.com/volcengine/veGiantModel.git
cd veGiantModel
git submodule update --init --recursive
```
## Modify script
Modify examples/gpt/pretrain_gpt2_distributed.sh before run
```
DATA_PATH -- the preprocessed gpt data local folder path
CHECKPOINT_PATH -- local path to save/load check point
MASTER_PORT -- port number used by torch ddp
WORKER_0_PORT -- port number for veGiantModel use for communication
WORKER_0_HOST -- ip of the master node (single node training can use 'localhost')
NUM_WORKER -- number of workers in the training
WORKER_RANK -- rank of current node
GPU_PER_WORKER -- number of GPUs per node
```
## run script
```
bash examples/gpt/pretrain_gpt2_distributed.sh
```
================================================
FILE: examples/gpt/gpt_piped.py
================================================
import torch
from megatron import get_args, mpu
from megatron.model.language_model import parallel_lm_logits, Embedding
from megatron.model.transformer import ParallelTransformerLayer
from megatron.model.transformer import LayerNorm
from megatron.model.gpt2_model import gpt2_attention_mask_func
from megatron.model.utils import init_method_normal
from megatron.model.utils import scaled_init_method_normal
from megatron.module import MegatronModule
from megatron.utils import get_ltor_masks_and_position_ids
from deepspeed.pipe import LayerSpec, TiedLayerSpec
from megatron import get_tokenizer
from veGiantModel.engine.module import VeGiantModule
import veGiantModel
class GPTModelPiped(VeGiantModule):
def __init__(self):
args = get_args()
self.fp16_lm_cross_entropy = args.fp16_lm_cross_entropy
self.tokenizer = get_tokenizer()
self.parallel_output = True
self.num_layers = args.num_layers
self.hidden_size = args.hidden_size
self.init_method = init_method_normal(args.init_method_std)
self.scale_init_method = scaled_init_method_normal(args.init_method_std,
args.num_layers)
self.num_tokentypes = 0
layers = []
layers.append(lambda x: self._get_batch(x))
layers.append(TiedLayerSpec("SharedEmbedding",
EmbeddingPiped,
self.hidden_size,
args.padded_vocab_size,
args.max_position_embeddings,
args.hidden_dropout,
self.init_method,
self.num_tokentypes,
tied_weight_attr='embedding_weight'))
layers.append(lambda x: (x[0].transpose(0, 1).contiguous(), x[1]))
for i in range(self.num_layers):
layers.append(LayerSpec(ParallelTransformerLayerPiped,
gpt2_attention_mask_func,
self.init_method,
self.scale_init_method,
i+1))
layers.append(lambda x: (x[0].transpose(0, 1).contiguous()))
layers.append(LayerSpec(LayerNorm, args.hidden_size, eps=args.layernorm_epsilon))
layers.append(TiedLayerSpec("SharedEmbedding",
LMLogitsPiped,
self.hidden_size,
args.padded_vocab_size,
self.init_method,
tied_weight_attr='embedding_weight'))
super().__init__(layers=layers,
num_stages = args.num_stages,
partition_method=args.partition_method,
grid=veGiantModel.distributed.get_grid(),
loss_fn=self.loss_fn)
# Data Preprocessing, copied from pretrain_gpt2.py
def _get_batch(self, data):
"""Generate a batch"""
args = get_args()
# Unpack.
tokens = data
attention_mask, _, position_ids = get_ltor_masks_and_position_ids(
tokens,
self.tokenizer.eod,
args.reset_position_ids,
args.reset_attention_mask,
args.eod_mask_loss)
return (tokens.to(device="cuda"),
position_ids.to(device="cuda"),
attention_mask.to(device="cuda"))
def loss_fn(self, inputs, data):
tokens = data[0]
target = data[1]
args = get_args()
_, loss_mask, _ = get_ltor_masks_and_position_ids(
tokens,
self.tokenizer.eod,
args.reset_position_ids,
args.reset_attention_mask,
args.eod_mask_loss)
if self.fp16_lm_cross_entropy:
assert inputs.dtype == torch.half
loss = mpu.vocab_parallel_cross_entropy(inputs, target)
else:
loss = mpu.vocab_parallel_cross_entropy(inputs.float(), target)
loss_mask = loss_mask.view(-1)
loss_avg = torch.sum(loss.view(-1) * loss_mask) / loss_mask.sum()
if loss.dtype == torch.half:
loss_avg = loss_avg.half()
return loss_avg
def batch_fn(self, batch, is_train:bool):
if batch is not None:
data = {'text': torch.tensor(batch['text'].numpy())}
else:
data = None
keys = ['text']
datatype = torch.int64
data_b = mpu.broadcast_data(keys, data, datatype)
tokens_ = data_b['text'].long()
tokens_write = tokens_
labels = tokens_[:, 1:].contiguous()
tokens_ = tokens_[:, :-1].contiguous()
tokens_2 = torch.unsqueeze(tokens_, 0)
data2 = torch.cat((tokens_2, labels[None, :, :]), dim=0)
data = []
data.append(tokens_)
data.append(data2)
return data
class LMLogitsPiped(MegatronModule):
def __init__(self, hidden_size, vocab_size, init_method):
super().__init__()
self.word_embeddings = mpu.VocabParallelEmbedding(
vocab_size, hidden_size, init_method=init_method)
self.embedding_weight = self.word_embeddings.weight
def forward(self, lm_output):
return parallel_lm_logits(lm_output, self.embedding_weight, True)
class EmbeddingPiped(Embedding):
def __init__(self,
hidden_size,
vocab_size,
max_sequence_length,
embedding_dropout_prob,
init_method,
num_tokentypes=0):
super().__init__(hidden_size,
vocab_size,
max_sequence_length,
embedding_dropout_prob,
init_method,
num_tokentypes)
self.embedding_weight = self.word_embeddings.weight
def forward(self, inputs):
input_ids, position_ids, attention_mask = inputs
return super().forward(input_ids, position_ids, None), attention_mask
class ParallelTransformerLayerPiped(ParallelTransformerLayer):
def __init__(self,
attention_mask_func,
init_method,
output_layer_init_method,
layer_number):
super().__init__(attention_mask_func,
init_method,
output_layer_init_method,
layer_number)
def forward(self, inputs):
hidden_states, attention_mask = inputs
return (super().forward(hidden_states, attention_mask),
attention_mask)
================================================
FILE: examples/gpt/initialize.py
================================================
import torch
import json
import veGiantModel
from megatron import get_args, mpu
from megatron.fp16 import FP16_Module
from torch.nn.parallel.distributed import DistributedDataParallel as torchDDP
from megatron.model import DistributedDataParallel as LocalDDP
from megatron.model import get_params_for_weight_decay_optimization
from apex.optimizers import FusedAdam as Adam
from megatron.learning_rates import AnnealingLR
from megatron import print_rank_0
def get_learning_rate_scheduler(optimizer, lr_scheduler_builder):
"""Build the learning rate scheduler."""
args = get_args()
if lr_scheduler_builder is not None:
lr_scheduler = lr_scheduler_builder(optimizer)
else:
# Add linear learning rate scheduler.
if args.lr_decay_iters is not None:
num_iters = args.lr_decay_iters
else:
num_iters = args.train_iters
num_iters = max(1, num_iters)
init_step = 0
warmup_iter = args.warmup * num_iters
lr_scheduler = AnnealingLR(
optimizer,
start_lr=args.lr,
warmup_iter=warmup_iter,
total_iters=num_iters,
decay_style=args.lr_decay_style,
last_iter=init_step,
min_lr=args.min_lr,
use_checkpoint_lr_scheduler=args.use_checkpoint_lr_scheduler,
override_lr_scheduler=args.override_lr_scheduler)
return lr_scheduler
def get_model(model_provider_func):
"""Build the model."""
args = get_args()
# Build model on cpu.
model = model_provider_func()
# Print number of parameters.
if mpu.get_data_parallel_rank() == 0:
print(' > number of parameters on model parallel rank {}: {}'.format(
mpu.get_model_parallel_rank(),
sum([p.nelement() for p in model.parameters()])), flush=True)
# GPU allocation.
model.cuda(torch.cuda.current_device())
return model
def get_optimizer(model):
"""Set up the optimizer."""
args = get_args()
# Build parameter groups (weight decay and non-decay).
while isinstance(model, (torchDDP, LocalDDP, FP16_Module)):
model = model.module
param_groups = get_params_for_weight_decay_optimization(model)
# Add model parallel attribute if it is not set.
for param_group in param_groups:
for param in param_group['params']:
if not hasattr(param, 'model_parallel'):
param.model_parallel = False
if args.cpu_optimizer:
if args.cpu_torch_adam:
cpu_adam_optimizer = torch.optim.Adam
else:
from deepspeed.ops.adam import DeepSpeedCPUAdam
cpu_adam_optimizer = DeepSpeedCPUAdam
optimizer = cpu_adam_optimizer(param_groups,
lr=args.lr, weight_decay=args.weight_decay)
else:
# Use Adam.
optimizer = Adam(param_groups, lr=args.lr, weight_decay=args.weight_decay)
if args.deepspeed:
# fp16 wrapper is not required for DeepSpeed.
return optimizer
# Wrap into fp16 optimizer.
if args.fp16:
optimizer = FP16_Optimizer(optimizer,
static_loss_scale=args.loss_scale,
dynamic_loss_scale=args.dynamic_loss_scale,
dynamic_loss_args={
'scale_window': args.loss_scale_window,
'min_scale': args.min_scale,
'delayed_shift': args.hysteresis},
fp16_optim=args.fp16_optim)
return optimizer
def setup_model_and_optimizer(model, optimizer, train_dataset_provider, lr_scheduler_builder):
"""Setup model and optimizer."""
args = get_args()
if optimizer is None:
optimizer = get_optimizer(model)
lr_scheduler = get_learning_rate_scheduler(optimizer, lr_scheduler_builder)
print_rank_0("DeepSpeed is enabled.")
# Print number of parameters.
if mpu.get_data_parallel_rank() == 0:
print(' > number of parameters on data parallel rank {}, model parallel rank {}, pipeline parallel rank {}: {}'.format(
mpu.get_data_parallel_rank(),
mpu.get_model_parallel_rank(),
mpu.get_pipe_parallel_rank(),
sum([p.nelement() for p in model.parameters()])), flush=True)
if args.deepspeed_pipeline:
print_rank_0("Pipeline Parallelism is enabled.")
train_data = train_dataset_provider() if train_dataset_provider is not None else None
_param_dict = json.loads(args.config_param)
engine, optimizer, _, lr_scheduler = veGiantModel.initialize(
model=model,
optimizer=optimizer,
args=args,
lr_scheduler=lr_scheduler,
mpu=None,
dist_init_required=False,
config_params = _param_dict,
training_data=train_data
)
engine.set_batch_fn(model.batch_fn)
else:
engine, optimizer, _, lr_scheduler = veGiantModel.initialize(
model=model,
optimizer=optimizer,
args=args,
lr_scheduler=lr_scheduler,
mpu=mpu,
dist_init_required=False
)
print_rank_0("Model Preparation Done")
args.iteration = 0
return engine, optimizer, lr_scheduler
def initialize_pipeline(model, optimizer, train_dataset_provider, lr_scheduler_builder=None):
return setup_model_and_optimizer(model, optimizer, train_dataset_provider, lr_scheduler_builder)
def initialize_distributed(num_stages, mp_size, distributed_backend='nccl'):
veGiantModel.init_distribute(num_stages=num_stages, mp_size=mp_size, distributed_backend=distributed_backend)
def initialize_megatron(extra_args_provider=None, args_defaults={}):
veGiantModel.initialize_megatron(extra_args_provider=extra_args_provider, args_defaults=args_defaults)
================================================
FILE: examples/gpt/pretrain_gpt2.py
================================================
# Copyright (c) 2021, ByteDance Inc. All rights reserved.
# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
"""Pretrain GPT2"""
import torch
import os
import numpy as np
import time
import sys
_cwd = os.path.dirname(os.path.abspath(__file__))
_giantModel_dir = os.path.join(_cwd, '../../src')
sys.path.append(_giantModel_dir)
from initialize import initialize_megatron, initialize_pipeline
from gpt_piped import GPTModelPiped
from megatron import get_args, mpu
from megatron import get_timers
from megatron import get_tensorboard_writer
from megatron import print_rank_0
from megatron.learning_rates import AnnealingLR
from megatron.training import build_train_valid_test_data_iterators
from megatron.data.gpt2_dataset import get_indexed_dataset_, get_train_valid_test_split_, _num_tokens, _num_epochs, _build_doc_idx, _build_shuffle_idx
from deepspeed.utils import log_dist
def _build_index_mappings(name, data_prefix, documents, sizes,
num_samples, seq_length, seed):
"""Build doc-idx, sample-idx, and shuffle-idx.
doc-idx: is an array (ordered) of documents to be used in training.
sample-idx: is the start document index and document offset for each
training sample.
shuffle-idx: maps the sample index into a random index into sample-idx.
"""
log_dist(f' >>>> Entering _build_index_mappings', ranks=[-1])
# Number of tokens in each epoch and number of required epochs.
args = get_args()
tokens_per_epoch = _num_tokens(documents, sizes)
num_epochs = _num_epochs(tokens_per_epoch, seq_length, num_samples)
# rng state
np_rng = np.random.RandomState(seed=seed)
# Filename of the index mappings.
_filename = data_prefix
_filename += '_{}_{}_indexmap'.format(args.rank, name)
_filename += '_{}ns'.format(num_samples)
_filename += '_{}sl'.format(seq_length)
_filename += '_{}s'.format(seed)
doc_idx_filename = _filename + '_doc_idx.npy'
sample_idx_filename = _filename + '_sample_idx.npy'
shuffle_idx_filename = _filename + '_shuffle_idx.npy'
# Build the indexed mapping if not exist.
device_count = torch.cuda.device_count()
if (not os.path.isfile(doc_idx_filename)) or \
(not os.path.isfile(sample_idx_filename)) or \
(not os.path.isfile(shuffle_idx_filename)):
log_dist(f' > WARNING: could not find index map files, building '
'the indices ...', ranks=[-1])
# doc-idx.
start_time = time.time()
doc_idx = _build_doc_idx(documents, num_epochs, np_rng)
np.save(doc_idx_filename, doc_idx, allow_pickle=True)
log_dist(' > elasped time to build and save doc-idx mapping '
'(seconds): {:4f}'.format(time.time() - start_time), ranks=[-1])
# sample-idx.
start_time = time.time()
# Use C++ implementation for speed.
# First compile and then import.
from megatron.data.dataset_utils import compile_helper
compile_helper()
from megatron.data import helpers
assert doc_idx.dtype == np.int32
assert sizes.dtype == np.int32
sample_idx = helpers.build_sample_idx(sizes, doc_idx, seq_length,
num_epochs, tokens_per_epoch)
# sample_idx = _build_sample_idx(sizes, doc_idx, seq_length,
# num_epochs, tokens_per_epoch)
np.save(sample_idx_filename, sample_idx, allow_pickle=True)
log_dist(' > elasped time to build and save sample-idx mapping '
'(seconds): {:4f}'.format(time.time() - start_time), ranks=[-1])
# shuffle-idx.
start_time = time.time()
# -1 is due to data structure used to retieve the index:
# sample i --> [sample_idx[i], sample_idx[i+1])
shuffle_idx = _build_shuffle_idx(sample_idx.shape[0] - 1, np_rng)
np.save(shuffle_idx_filename, shuffle_idx, allow_pickle=True)
log_dist(' > elasped time to build and save shuffle-idx mapping'
' (seconds): {:4f}'.format(time.time() - start_time), ranks=[-1])
# This should be a barrier but nccl barrier assumes
# device_index=rank which is not the case for model
# parallel case
counts = torch.cuda.LongTensor([1])
torch.distributed.all_reduce(counts, group=mpu.get_data_parallel_group())
assert counts[0].item() == torch.distributed.get_world_size(
group=mpu.get_data_parallel_group())
# Load mappings.
start_time = time.time()
log_dist(' > loading doc-idx mapping from {}'.format(
doc_idx_filename))
if not os.path.isfile(doc_idx_filename):
log_dist(' > loading doc-idx mapping from {} failed, file not exist'.format(
doc_idx_filename), ranks=[-1])
doc_idx = np.load(doc_idx_filename, allow_pickle=True, mmap_mode='r')
log_dist(' > loading sample-idx mapping from {}'.format(
sample_idx_filename), ranks=[-1])
if not os.path.isfile(sample_idx_filename):
log_dist(' > loading doc-idx mapping from {} failed, file not exist'.format(
sample_idx_filename), ranks=[-1])
sample_idx = np.load(sample_idx_filename, allow_pickle=True, mmap_mode='r')
log_dist(' > loading shuffle-idx mapping from {}'.format(
shuffle_idx_filename), ranks=[-1])
if not os.path.isfile(shuffle_idx_filename):
log_dist(' > loading doc-idx mapping from {} failed, file not exist'.format(
shuffle_idx_filename), ranks=[-1])
shuffle_idx = np.load(shuffle_idx_filename, allow_pickle=True, mmap_mode='r')
log_dist(' loaded indexed file in {:3.3f} seconds'.format(
time.time() - start_time), ranks=[-1])
log_dist(' total number of samples: {}'.format(
sample_idx.shape[0]), ranks=[-1])
log_dist(' total number of epochs: {}'.format(num_epochs), ranks=[-1])
log_dist(f' >>>> exiting _build_index_mappings', ranks=[-1])
return doc_idx, sample_idx, shuffle_idx
class GPT2DatasetFixed(torch.utils.data.Dataset):
def __init__(self, name, data_prefix, documents, indexed_dataset,
num_samples, seq_length, seed):
self.name = name
self.indexed_dataset = indexed_dataset
# Checks
assert np.min(documents) >= 0
assert np.max(documents) < indexed_dataset.sizes.shape[0]
# Build index mappings.
self.doc_idx, self.sample_idx, self.shuffle_idx = _build_index_mappings(
self.name, data_prefix, documents, self.indexed_dataset.sizes,
num_samples, seq_length, seed)
def __len__(self):
# -1 is due to data structure used to retieve the index:
# sample i --> [sample_idx[i], sample_idx[i+1])
return self.sample_idx.shape[0] - 1
def __getitem__(self, idx):
# Get the shuffled index.
idx = self.shuffle_idx[idx]
# Start and end documents and offsets.
doc_index_f = self.sample_idx[idx][0]
doc_index_l = self.sample_idx[idx + 1][0]
offset_f = self.sample_idx[idx][1]
offset_l = self.sample_idx[idx + 1][1]
# If we are within the same document, just extract the chunk.
if doc_index_f == doc_index_l:
sample = self.indexed_dataset.get(self.doc_idx[doc_index_f],
offset=offset_f,
length=offset_l - offset_f + 1)
else:
# Otherwise, get the rest of the initial document.
sample_list = [self.indexed_dataset.get(self.doc_idx[doc_index_f],
offset=offset_f)]
# Loop over all in between documents and add the entire document.
for i in range(doc_index_f + 1, doc_index_l):
sample_list.append(self.indexed_dataset.get(self.doc_idx[i]))
# And finally add the relevant portion of last document.
sample_list.append(self.indexed_dataset.get(
self.doc_idx[doc_index_l],
length=offset_l + 1))
sample = np.concatenate(sample_list)
return {'text': np.array(sample, dtype=np.int64)}
def build_train_valid_test_datasets(data_prefix, data_impl, splits_string,
train_valid_test_num_samples,
seq_length, seed, skip_warmup):
"""Build train, valid, and test datasets."""
# Indexed dataset.
indexed_dataset = get_indexed_dataset_(data_prefix,
data_impl,
skip_warmup)
total_num_of_documents = indexed_dataset.sizes.shape[0]
splits = get_train_valid_test_split_(splits_string, total_num_of_documents)
# Print stats about the splits.
print_rank_0(' > dataset split:')
def print_split_stats(name, index):
print_rank_0(' {}:'.format(name))
print_rank_0(' document indices in [{}, {}) total of {} '
'documents'.format(splits[index], splits[index + 1],
splits[index + 1] - splits[index]))
print_split_stats('train', 0)
print_split_stats('validation', 1)
print_split_stats('test', 2)
def build_dataset(index, name):
dataset = None
if splits[index + 1] > splits[index]:
documents = np.arange(start=splits[index], stop=splits[index + 1],
step=1, dtype=np.int32)
dataset = GPT2DatasetFixed(name, data_prefix,
documents, indexed_dataset,
train_valid_test_num_samples[index],
seq_length, seed)
return dataset
train_dataset = build_dataset(0, 'train')
valid_dataset = build_dataset(1, 'valid')
test_dataset = build_dataset(2, 'test')
return (train_dataset, valid_dataset, test_dataset)
def model_provider():
"""Build the model."""
print_rank_0('building GPT2 model ...')
model = GPTModelPiped()
return model
def lr_scheduler_builder(optimizer):
"""Build the learning rate scheduler."""
args = get_args()
# Add linear learning rate scheduler.
if args.lr_decay_iters is not None:
num_iters = args.lr_decay_iters
else:
num_iters = args.train_iters
num_iters = max(1, num_iters)
init_step = 0
warmup_iter = args.warmup * num_iters
lr_scheduler = AnnealingLR(
optimizer,
start_lr=args.lr,
warmup_iter=warmup_iter,
total_iters=num_iters,
decay_style=args.lr_decay_style,
last_iter=init_step,
min_lr=args.min_lr,
use_checkpoint_lr_scheduler=args.use_checkpoint_lr_scheduler,
override_lr_scheduler=args.override_lr_scheduler)
return lr_scheduler
def pretrain(model_provider, args_defaults={}):
initialize_megatron(args_defaults=args_defaults)
timers = get_timers()
# Model, optimizer, and learning rate.
timers('model and optimizer').start()
model = model_provider()
engine, optimizer, lr_scheduler = initialize_pipeline(model, None, None, lr_scheduler_builder)
timers('model and optimizer').stop()
# Print setup timing.
print_rank_0('done with setups ...')
print_rank_0('training ...')
train(engine, optimizer, lr_scheduler)
def traing_log(loss_dict, iteration):
args = get_args()
timers = get_timers()
writer = get_tensorboard_writer()
# Logging.
timers_to_log = []
def add_to_logging(name):
if name in timers.timers:
timers_to_log.append(name)
add_to_logging('forward')
add_to_logging('backward')
add_to_logging('backward-backward')
add_to_logging('backward-allreduce')
add_to_logging('backward-master-grad')
add_to_logging('backward-clip-grad')
add_to_logging('optimizer')
add_to_logging('batch generator')
if writer and torch.distributed.get_rank() == 0:
writer.add_scalar('loss', loss_dict, iteration)
normalizer = iteration % args.log_interval
if normalizer == 0:
normalizer = args.log_interval
timers.write(timers_to_log, writer, iteration,
normalizer=normalizer)
def train_valid_test_dataset_provider(train_val_test_num_samples):
"""Build train, valid, and test datasets."""
args = get_args()
print_rank_0('> building train, validation, and test datasets '
'for GPT ...')
train_ds, valid_ds, test_ds = build_train_valid_test_datasets(
data_prefix=args.data_path,
data_impl=args.data_impl,
splits_string=args.split,
train_valid_test_num_samples=train_val_test_num_samples,
seq_length=args.seq_length,
seed=args.seed,
skip_warmup=(not args.mmap_warmup))
print_rank_0("> finished creating GPT datasets ...")
return train_ds, valid_ds, test_ds
def train(engine, optimizer, lr_scheduler):
"""Train the model function."""
args = get_args()
timers = get_timers()
# Turn on training mode which enables dropout.
engine.train()
# Iterations.
iteration = args.iteration
timers('interval time').start()
train_data_iterator, valid_data_iterator, test_data_iterator \
= build_train_valid_test_data_iterators(train_valid_test_dataset_provider)
log_dist(f' >>>> start training', ranks=[-1])
while iteration < args.train_iters:
engine.train_batch(train_data_iterator)
if __name__ == "__main__":
pretrain(model_provider,
args_defaults={'tokenizer_type': 'GPT2BPETokenizer'})
================================================
FILE: examples/gpt/pretrain_gpt2_distributed.sh
================================================
#! /bin/bash
# Runs the "345M" parameter model
DATA_PATH=<Specify path where >
CHECKPOINT_PATH=<Specify path>
export WORKER_0_HOST=127.0.0.1
export DMLC_NODE_HOST=127.0.0.1
export WORKER_0_PORT=6000
export NUM_WORKER=1
export WORKER_RANK=0
export GPU_PER_WORKER=8
export BYTEPS_WITH_UCX=0
export DMLC_ENABLE_UCX=0
export DMLC_ENABLE_RDMA=0
MASTER_PORT=6002
MASTER_ADDR=$WORKER_0_HOST
GPUS_PER_NODE=$GPU_PER_WORKER
NNODES=$NUM_WORKER
NODE_RANK=$WORKER_RANK
WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))
base_dir=$(cd `dirname $0`; pwd)
echo base_dir $base_dir
DISTRIBUTED_ARGS="--nproc_per_node $GPUS_PER_NODE --nnodes $NNODES --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT"
ds_config='{
"train_micro_batch_size_per_gpu":16,
"train_batch_size" : 16,
"gradient_accumulation_steps": 2,
"steps_per_print": 1,
"gradient_clipping": 1.0,
"zero_optimization": {
"stage": 0,
"allgather_partitions": true,
"allgather_bucket_size": 500000000,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 500000000,
"contiguous_gradients" : true,
"cpu_offload": false
},
"fp16": {
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 1000,
"hysteresis": 2,
"min_loss_scale": 1
},
"wall_clock_breakdown": true
}'
python3 -m torch.distributed.launch $DISTRIBUTED_ARGS \
--no_python --use_env python3 \
${base_dir}/pretrain_gpt2.py \
--model-parallel-size 2 \
--num-stages 2 \
--num-layers 24 \
--hidden-size 1024 \
--train-batch-size 64 \
--gradient_accumulation_steps 16 \
--num-attention-heads 16 \
--batch-size 4 \
--seq-length 1024 \
--max-position-embeddings 1024 \
--train-iters 500000 \
--lr-decay-iters 450000 \
--save $CHECKPOINT_PATH \
--load $CHECKPOINT_PATH \
--data-path $DATA_PATH/openwebtext-gpt2_text_document \
--vocab-file $DATA_PATH/gpt2-vocab.json \
--merge-file $DATA_PATH/gpt2-merges.txt \
--data-impl mmap \
--split 949,50,1 \
--distributed-backend nccl \
--lr 0.00025 \
--lr-decay-style cosine \
--min-lr 1.0e-5 \
--weight-decay 1e-2 \
--clip-grad 1.0 \
--warmup .02 \
--log-interval 1 \
--save-interval 100000 \
--vocab-size 145608 \
--DDP-impl torch \
--eod-mask-loss \
--deepspeed-pipeline \
--deepspeed \
--config_param "$ds_config" \
--fp16 \
--partition_method "type:ParallelTransformerLayerPiped" \
$@
set +x
================================================
FILE: src/veGiantModel/__init__.py
================================================
# Copyright (c) 2021, ByteDance Inc. All rights reserved.
import sys
import os
cwd = os.path.dirname(os.path.abspath(__file__))
_deepspeed_dir = os.path.join(cwd, '../../third_party/deepspeed')
_megatron_dir = os.path.join(cwd, '../../third_party/megatron')
sys.path.append(cwd)
sys.path.append(_deepspeed_dir)
sys.path.append(_megatron_dir)
from . import patcher
from .engine.engine import VeGiantModelEngine
from .initialize import initialize_megatron, init_distribute
from .distributed import *
def initialize(args,
model,
optimizer=None,
model_parameters=None,
training_data=None,
lr_scheduler=None,
mpu=None,
dist_init_required=None,
collate_fn=None,
config_params=None):
engine = VeGiantModelEngine(args=args,
model=model,
optimizer=optimizer,
model_parameters=model_parameters,
training_data=training_data,
lr_scheduler=lr_scheduler,
mpu=model.mpu(),
dist_init_required=dist_init_required,
collate_fn=collate_fn,
config_params=config_params)
return_items = [
engine,
engine.optimizer,
engine.training_dataloader,
engine.lr_scheduler
]
return tuple(return_items)
================================================
FILE: src/veGiantModel/distributed/__init__.py
================================================
from .. import patcher as dist
from megatron import mpu
def get_model_parallel_world_size():
return dist.get_model_parallel_world_size()
def get_model_parallel_rank():
return dist.get_model_parallel_rank()
def get_data_parallel_world_size():
return dist.get_data_parallel_world_size()
def get_model_parallel_group():
return dist.get_model_parallel_group()
def get_grid():
return dist.get_grid()
def copy_to_model_parallel_region(input_):
return mpu.copy_to_model_parallel_region(input_)
def reduce_from_model_parallel_region(input_):
return mpu.reduce_from_model_parallel_region(input_)
def gather_from_model_parallel_region(input_):
return mpu.gather_from_model_parallel_region(input_)
================================================
FILE: src/veGiantModel/engine/engine.py
================================================
# Copyright (c) 2021, ByteDance Inc. All rights reserved.
# Copyright 2019 The Microsoft DeepSpeed Team
import os
from types import MethodType
import torch
import torch.distributed as dist
from deepspeed.utils.logging import logger
from deepspeed.utils.timer import ThroughputTimer
from deepspeed.runtime.engine import MEMORY_OPT_ALLREDUCE_SIZE
from deepspeed.runtime.dataloader import RepeatingLoader
from deepspeed.runtime.pipe.module import PipelineModule, PipelineError
from deepspeed.runtime.pipe.engine import PipelineEngine
from . import p2p
from . import schedule
try:
import byteps.torch as bps
except ImportError:
print("byteps is not installed. Pipeline parallelism is disabled")
bps = None
from .module import VeGiantModule
from deepspeed.utils import log_dist
import logging
from torch._six import inf
# from inspect import signature
LOG_STAGE = -2
DATA_PARALLEL_ID = -2
try:
from apex import amp
except ImportError:
# Fail silently so we don't spam logs unnecessarily if user isn't using amp
pass
def is_even(number):
return number % 2 == 0
ENABLE_PYTORCH_BROADCAST = os.environ.get("ENABLE_PYTORCH_BROADCAST", "0") != "0"
DS_PIPE_VERBOSE = int(os.environ.get('DS_PIPE_VERBOSE', "0"))
MEGATRON_DEBUG_DATA = os.environ.get('MEGATRON_DEBUG_DATA', "0") != "0"
MEGATRON_DEBUG_GRAD = os.environ.get('MEGATRON_DEBUG_GRAD', "0") != "0"
ENABLE_BPS_PARTITION = os.environ.get("ENABLE_BPS_PARTITION", "0") != "0"
def _tensor_bytes(tensor):
return tensor.numel() * tensor.element_size()
def _dtype_to_code(dtype):
if dtype == torch.half:
return 0
elif dtype == torch.float:
return 1
elif dtype == torch.int16:
return 2
elif dtype == torch.int32:
return 3
elif dtype == torch.int64:
return 4
elif dtype == torch.bool:
return 5
else:
raise AssertionError("not recognized tensor type for pipeline send")
def _code_to_dtype(code):
if code == 0:
return torch.half
elif code == 1:
return torch.float
elif code == 2:
return torch.int16
elif code == 3:
return torch.int32
elif code == 4:
return torch.int64
elif code == 5:
return torch.bool
else:
raise AssertionError("not recognized tensor type code for pipeline recv")
class VeGiantModelEngine(PipelineEngine):
""" A training engine hybrid pipeline, data, and model parallel training.
This engine is created by ``deepspeed.initialize()`` when a :class:`PipelineModule`
is provided.
"""
def overwrite(self, config_params, args):
if args.batch_size is not None:
log_dist(f'overwrite dsconfig train_micro_batch_size_per_gpu to {args.batch_size}', \
ranks=[-1], level=logging.DEBUG)
config_params['train_micro_batch_size_per_gpu'] = args.batch_size
if args.gradient_accumulation_steps is not None:
log_dist(f'overwrite dsconfig gradient_accumulation_steps to {args.gradient_accumulation_steps}', \
ranks=[-1], level=logging.DEBUG)
config_params['gradient_accumulation_steps'] = args.gradient_accumulation_steps
if args.train_batch_size is not None:
log_dist(f'overwrite dsconfig train_batch_size to {args.train_batch_size}, ', \
ranks=[-1], level=logging.DEBUG)
config_params['train_batch_size'] = args.train_batch_size
if args.log_interval is not None:
config_params['steps_per_print'] = args.log_interval
def __init__(self, args,
model,
optimizer,
model_parameters,
training_data,
lr_scheduler,
mpu,
dist_init_required,
collate_fn,
config_params):
self.overwrite(config_params, args)
super(PipelineEngine, self).__init__(args,
model,
optimizer,
model_parameters,
training_data,
lr_scheduler,
mpu,
dist_init_required,
collate_fn,
config_params)
assert isinstance(self.module, PipelineModule), "model must base PipelineModule"
# pipeline step for logging
self.args = args
self.log_batch_step_id = -1
self.train_mode = True
self.enable_backward_allreduce = False
self.micro_batch_size = self.train_micro_batch_size_per_gpu()
self.micro_batches = self.gradient_accumulation_steps()
self.first_train = True
self.first_eval = True
# Set Grid and Communication Groups
self.grid = self.module._grid
if self.grid.get_global_rank() == 0:
logger.info(f'CONFIG: micro_batches={self.micro_batches} '
f'micro_batch_size={self.micro_batch_size}')
self.global_rank = self.grid.get_global_rank()
assert self.dp_world_size == self.grid.data_parallel_size
assert self.train_batch_size() == \
self.micro_batch_size * self.micro_batches * self.grid.data_parallel_size
# Set Stage Inf
self.num_stages = self.grid.pipe_parallel_size
self.stage_id = self.grid.get_stage_id()
self.mp_id = self.grid.get_model_parallel_id()
self.prev_stage = self.stage_id - 1
self.next_stage = self.stage_id + 1
self.data_iterator = None
self.batch_fn = None
self.result_dict = {}
self._force_grad_boundary = False
self.batch_timer = ThroughputTimer(batch_size=self.micro_batch_size *
self.micro_batches,
num_workers=self.dp_world_size,
logging_fn=self.tput_log,
monitor_memory=False,
steps_per_output=self.steps_per_print())
# PipelineEngine needs to handle data loading specially due to only the first
# and last stages loading inputs/labels. We construct a sampler that uses
if self.training_data:
self._build_data_iter(self.training_data)
self.is_pipe_parallel = self.grid.pipe_parallel_size > 1
self.is_data_parallel = self.grid.data_parallel_size > 1
self.is_model_parallel = self.grid.model_parallel_size > 1
# Partition input/output buffers
self.is_pipe_partitioned = False if self.args.broadcast_activation else (self.is_model_parallel and ENABLE_PYTORCH_BROADCAST)
self.is_grad_partitioned = False
model_parameters = filter(lambda p: p.requires_grad, self.module.parameters())
num_params = sum([p.numel() for p in model_parameters])
unique_params = num_params
# Subtract tied parameters if we don't own them
if self.module.tied_comms:
tied_params = 0
for key, d in self.module.tied_comms.items():
if self.global_rank != min(d['ranks']):
tied_params += sum(p.numel() for p in d['module'].parameters())
unique_params -= tied_params
params_tensor = torch.LongTensor(data=[num_params,
unique_params]).to(self.device)
print(f'Calculating param sizes ... ', flush=True)
dist.all_reduce(params_tensor, group=self.grid.get_model_parallel_group())
params_tensor = params_tensor.tolist()
total_params = params_tensor[0]
unique_params = params_tensor[1]
if self.grid.data_parallel_id == 0:
logger.info(f'RANK={self.global_rank} '
f'STAGE={self.stage_id} '
f'LAYERS={self.module._local_stop - self.module._local_start} '
f'[{self.module._local_start}, {self.module._local_stop}) '
f'STAGE_PARAMS={num_params} ({num_params/1e6:0.3f}M) '
f'TOTAL_PARAMS={total_params} ({total_params/1e6:0.3f}M) '
f'UNIQUE_PARAMS={unique_params} ({unique_params/1e6:0.3f}M)')
print(f'DONE calculating param sizes. Now init proc groups', flush=True)
#intialize peer-2-peer communication and allreduce groups
if self.is_pipe_parallel:
p2p.init_process_groups(self.grid)
# Pipeline buffers
self.num_pipe_buffers = 0
self.pipe_buffers = {
'inputs' : [], # batch input and received activations
'labels' : [], # labels from batch input
'outputs' : [], # activations
'output_tensors' : [], # tensor object to preserve backward graph
'bps_act_recv' : [], # activations recv
'bps_grad_recv' : [], # activations recv
}
self.pipe_recv_buf = None
self.grad_layer = None
self.meta_buffer = None
self.first_output_send = True
self.first_gradient_send = True
#stores the loss for the current micro batch being processed
self.loss = torch.tensor(0.0).to(self.device)
self.metric = 0
#stores the loss for the entire batch
self.total_loss = None
self.agg_loss = torch.tensor(0.0, requires_grad=False).to(self.device)
self.dp_group_loss = torch.tensor(0.0, requires_grad=False).to(self.device)
if self._config.pipeline['activation_checkpoint_interval'] > 0:
self.module.activation_checkpoint_interval = self._config.pipeline[
'activation_checkpoint_interval']
if self.is_last_stage():
self.loss_model = self.module.loss_fn
log_dist(f'Initialize pipeline communicators', \
ranks=[-1], level=logging.DEBUG)
# Initialize pipeline communicators. Just send a 0.
if is_even(self.stage_id):
if not self.is_last_stage():
p2p.send(self.loss, self.next_stage)
if not self.is_first_stage():
p2p.recv(self.loss, self.prev_stage)
else:
if not self.is_first_stage():
p2p.recv(self.loss, self.prev_stage)
if not self.is_last_stage():
p2p.send(self.loss, self.next_stage)
log_dist(f'DONE Initialize pipeline communicators', \
ranks=[-1], level=logging.DEBUG)
# XXX look into timer reporting timing
# Initialize some timers because of early weirdness.
if self.wall_clock_breakdown():
self.timers('forward_microstep').start()
self.timers('forward_microstep').stop()
self.timers('backward_microstep').start()
self.timers('backward_microstep').stop()
self.timers('backward_inner_microstep').start()
self.timers('backward_inner_microstep').stop()
self.timers('backward_allreduce_microstep').start()
self.timers('backward_allreduce_microstep').stop()
self.timers('backward_allreduce').start()
self.timers('backward_allreduce').stop()
self.timers('step_microstep').start()
self.timers('step_microstep').stop()
if self.local_rank == -1:
# or number of visiable device will be better
self.local_rank = self.global_rank % torch.cuda.device_count()
if not p2p.ENABLE_PYTORCH_BROADCAST:
gpu_per_node = int(os.environ['GPU_PER_WORKER'])
print(f'bps init worker: {gpu_per_node}, {self.local_rank}/{self.global_rank}', flush=True)
os.environ['BYTEPS_LOCAL_RANK'] = str(self.local_rank)
os.environ['BYTEPS_LOCAL_SIZE'] = str(gpu_per_node)
os.environ['BYTEPS_VISIBLE_DEVICE'] = str(self.local_rank)
os.environ['DMLC_ROLE'] = 'joint'
os.environ['DMLC_WORKER_ID'] = str(self.global_rank)
bps.init(lazy=False)
print(f'bps init DONE', flush=True)
def _profiling_func_exit(self):
torch.cuda.nvtx.range_pop()
def _profiling_func_enter(self, func):
torch.cuda.nvtx.range_push(f'stage_id: {self.stage_id}, mp_id: {self.mp_id}, fun: {func}')
def _build_data_iter(self, dataset):
if not isinstance(dataset, torch.utils.data.Dataset):
self.set_dataloader(dataset)
else:
sampler = torch.utils.data.distributed.DistributedSampler(
dataset,
num_replicas=self.dp_world_size,
rank=self.mpu.get_data_parallel_rank(),
shuffle=False)
# Build a loader and make it repeating.
pipe_dataloader = self.deepspeed_io(dataset, data_sampler=sampler)
pipe_dataloader = RepeatingLoader(pipe_dataloader)
self.set_dataloader(pipe_dataloader)
def _exec_reduce_tied_grads(self):
self._profiling_func_enter('_exec_reduce_tied_grads')
self.module.allreduce_tied_weight_gradients()
self._profiling_func_exit()
def _exec_reduce_grads(self):
self._profiling_func_enter('_exec_reduce_grads')
self._force_grad_boundary = True
if self.is_data_parallel:
self.buffered_allreduce_fallback(
elements_per_buffer=MEMORY_OPT_ALLREDUCE_SIZE)
self._force_grad_boundary = False
self._profiling_func_exit()
def _reserve_pipe_buffers(self, num_buffers):
"""Ensure that each pipeline buffer has at least ``num_buffers`` slots.
This method only reserves slots and does not allocate tensors.
Args:
num_buffers (int): The number of buffers to reserve.
"""
if self.num_pipe_buffers >= num_buffers:
return
num_added = num_buffers - self.num_pipe_buffers
for key in self.pipe_buffers:
self.pipe_buffers[key].extend([None] * num_added)
self.num_pipe_buffers = num_buffers
def train_batch(self, data_iter=None):
"""Progress the pipeline to train the next batch of data. The engine will ingest
``self.train_batch_size()`` total samples collectively across all workers.
An iterator that over training data should be provided as an argument
unless ``deepspeed.initialize()`` was provided a training set. In that event,
the training data will automatically be read.
.. warning::
A total of ``self.gradient_accumulation_steps()`` entries will be pulled
from ``data_iter`` by each pipeline. There must be sufficient
data left in ``data_iter`` or else a ``StopIteration`` will halt training.
DeepSpeed provides a convenience class :class:`deepspeed.utils.RepeatingLoader`
that wraps data loaders to automatically restart upon a ``StopIteration``.
Args:
data_iter (Iterator, optional): Iterator of training data.
Returns:
The arithmetic mean of the losses computed this batch.
"""
if DS_PIPE_VERBOSE:
print(f'[{self.global_rank}] start train_batch()', flush=True)
if not torch._C.is_grad_enabled():
raise RuntimeError(
f'train_batch() requires gradients enabled. Use eval_batch() instead.')
if data_iter is not None:
self.set_dataiterator(data_iter)
self.module.train()
self.train()
self.total_loss = None
# Do the work
self.timers('train_batch').start()
# We only enable prefetching starting from the second batch
if not ENABLE_PYTORCH_BROADCAST:
sched = schedule.BytePSTrainSchedule(micro_batches=self.micro_batches,
stages=self.num_stages,
stage_id=self.stage_id, prefetch=not self.first_train)
else:
sched = schedule.TrainSchedule(micro_batches=self.micro_batches,
stages=self.num_stages,
stage_id=self.stage_id)
cmd = ','.join(str(x) for x in sched)
# log_dist(f'stage_id: {self.stage_id}, sched:{cmd}', ranks=[-1], level=logging.INFO)
self._exec_schedule(sched)
self.agg_train_loss = self._aggregate_total_loss()
self.timers('train_batch').stop()
if self.global_steps % self.steps_per_print() == 0:
if self.global_rank == 0:
elapsed = self.timers('train_batch').elapsed(reset=True)
iter_time = elapsed / self.steps_per_print()
tput = self.train_batch_size() / iter_time
print(f'steps: {self.global_steps} '
f'loss: {self.agg_train_loss:0.4f} '
f'iter time (s): {iter_time:0.3f} '
f'samples/sec: {tput:0.3f}')
# Tensorboard
if self.tensorboard_enabled():
if self.global_rank == 0:
self.summary_events = [(f'Train/Samples/train_loss',
self.agg_train_loss.mean().item(),
self.global_samples)]
for event in self.summary_events: # write_summary_events
self.summary_writer.add_scalar(event[0], event[1], event[2])
if self.global_steps % self.steps_per_print() == 0:
self.summary_writer.flush()
if self.wall_clock_breakdown(
) and self.global_steps % self.steps_per_print() == 0:
self.timers.log([
'pipe_send_output',
'pipe_send_grad',
'pipe_recv_input',
'pipe_recv_grad'
])
# TODO: should return precisely what loss returned and allow others to be queried?
self.first_train = False
if DS_PIPE_VERBOSE:
print(f'[{self.global_rank}] DONE train_batch()', flush=True)
self.result_dict['loss'] = self.agg_train_loss
return self.result_dict
def eval_batch(self, data_iter):
"""Evaluate the pipeline on a batch of data from ``data_iter``. The
engine will evaluate ``self.train_batch_size()`` total samples
collectively across all workers.
This method is equivalent to:
.. code-block:: python
module.eval()
with torch.no_grad():
output = module(batch)
.. warning::
A total of ``self.gradient_accumulation_steps()`` entries will be pulled
from ``data_iter`` by each pipeline. There must be sufficient
data left in ``data_iter`` or else a ``StopIteration`` will halt training.
DeepSpeed provides a convenience class :class:`deepspeed.utils.RepeatingLoader`
that wraps data loaders to automatically restart upon a ``StopIteration``.
Args:
data_iter (Iterator): Iterator of data to evaluate.
Returns:
The arithmetic mean of the losses computed this batch.
"""
self.module.eval()
self.eval()
self.total_loss = None
# Use the provided data iterator
train_iterator = self.data_iterator
self.set_dataiterator(data_iter)
# Do the work
self.timers('eval_batch').start()
if not ENABLE_PYTORCH_BROADCAST:
sched = schedule.BytePSInferenceSchedule(micro_batches=1,
stages=self.num_stages,
stage_id=self.stage_id, prefetch=False)
else:
sched = schedule.InferenceSchedule(micro_batches=self.micro_batches,
stages=self.num_stages,
stage_id=self.stage_id)
with torch.no_grad():
self._exec_schedule(sched)
self.agg_eval_loss = self._aggregate_total_loss()
self.timers('eval_batch').stop()
# # XXX hack model attribute
# if hasattr(self.module, '_get_metrics'):
# self.module._ref_model[0].metric = {'pscc': self._aggregate_metric()}
# if self.global_rank == 0:
# elapsed = self.timers('eval_batch').elapsed(reset=True)
# iter_time = elapsed
# print(f'loss: {self.agg_eval_loss:0.4f} '
# f'iter time (s): {iter_time:0.3f} ')
if self.tensorboard_enabled():
if self.global_rank == 0:
self.summary_events = [(f'Train/Samples/eval_loss',
self.agg_eval_loss.mean().item(),
self.global_samples)]
for event in self.summary_events: # write_summary_events
self.summary_writer.add_scalar(event[0], event[1], event[2])
self.summary_writer.flush()
# Restore the training iterator
self.set_dataiterator(train_iterator)
# Reset any buffers that may have been populated during the forward passes.
#ds_checkpointing.reset()
self.first_eval = False
self.result_dict['loss'] = self.agg_eval_loss
return self.result_dict
def is_first_stage(self):
"""True if this process is in the first stage in the pipeline."""
return self.stage_id == 0
def is_last_stage(self):
"""True if this process is in the last stage in the pipeline."""
return self.stage_id == self.num_stages - 1
def _aggregate_metric(self):
# Scale loss, average among DP ranks, and bcast loss to the rest of my DP group
if self.is_last_stage():
if DS_PIPE_VERBOSE:
print(f'[{self.global_rank}] bcast src={self.global_rank} group={self.grid.pp_group}', flush=True)
if self.is_data_parallel:
assert False
assert self.global_rank in self.grid.pp_group
metric = torch.Tensor([self.metric]).to(self.device)
dist.broadcast(tensor=metric,
src=self.global_rank,
group=self.mpu.get_pipe_parallel_group())
else:
# Get loss from last stage
src_rank = self.grid.stage_to_global(self.num_stages - 1)
if DS_PIPE_VERBOSE:
print(f'[{self.global_rank}] bcast src={src_rank} group={self.grid.pp_group}', flush=True)
assert src_rank in self.grid.pp_group
metric = torch.Tensor([0.]).to(self.device)
dist.broadcast(tensor=metric,
src=src_rank,
group=self.grid.get_pipe_parallel_group())
self.metric = metric.clone().detach().cpu().numpy()
return self.metric
def _aggregate_total_loss(self):
# Scale loss, average among DP ranks, and bcast loss to the rest of my DP group
if self.is_last_stage():
# XXX Hack: do not scale loss
loss = self._scale_loss(self.total_loss)
self.dp_group_loss = loss.clone().detach()
## Average loss across all data-parallel groups
agg_loss = self.dp_group_loss.clone().detach()
if DS_PIPE_VERBOSE:
print(f'[{self.global_rank}] bcast SENDER src={self.global_rank} group={self.grid.pp_group}', flush=True)
if self.is_data_parallel:
dist.all_reduce(agg_loss, group=self.mpu.get_data_parallel_group())
agg_loss /= self.dp_world_size
assert self.global_rank in self.grid.pp_group
losses = torch.Tensor([self.dp_group_loss, agg_loss]).to(self.device)
dist.broadcast(tensor=losses,
src=self.global_rank,
group=self.mpu.get_pipe_parallel_group())
else:
# Get loss from last stage
src_rank = self.grid.stage_to_global(self.num_stages - 1)
assert src_rank in self.grid.pp_group
losses = torch.Tensor([0., 0.]).to(self.device)
if DS_PIPE_VERBOSE:
print(f'[{self.global_rank}] bcast RECVER src={src_rank} group={self.grid.pp_group}', flush=True)
dist.broadcast(tensor=losses,
src=src_rank,
group=self.grid.get_pipe_parallel_group())
self.dp_group_loss = losses[0].clone().detach()
agg_loss = losses[1].clone().detach()
if DS_PIPE_VERBOSE:
print(f'DONE aggregate total loss', flush=True)
return agg_loss
def set_dataloader(self, loader):
""""""
if self.is_first_stage() or self.is_last_stage():
self.training_dataloader = loader
self.data_iterator = iter(self.training_dataloader)
def set_dataiterator(self, iterator):
""" Store an iterator to sample for training data. """
if self.is_first_stage() or self.is_last_stage():
self.training_dataloader = None
self.data_iterator = iterator
def set_batch_fn(self, fn):
self.batch_fn = fn
# sig = signature(fn)
# params = sig.parameters
def is_gradient_accumulation_boundary(self):
"""True if the engine is executing a gradient reduction or optimizer step instruction.
This is overridden from :class:`DeepSpeedEngine` to force reductions
and steps when the pipeline engine is instructed to do so.
Returns:
bool: whether reductions and optimizer steps should occur.
"""
return self._force_grad_boundary
def tput_log(self, *msg):
if self.global_rank == 0 and self.global_steps % self.steps_per_print() == 0:
print(*msg)
def _next_batch(self):
if self.is_model_parallel:
mp_rank = self.grid.get_slice_parallel_rank()
else:
mp_rank = 0
batch = None
# Only MP rank 0 loads the data.
if mp_rank == 0:
if self.data_iterator is None:
raise ValueError(f"RANK={self.global_rank} no data iterator provided.")
batch = next(self.data_iterator)
# All MP ranks participate in batch_fn, where they might broadcast the data.
if self.batch_fn:
batch = self.batch_fn(batch, self.train_mode)
# Sanity check dimensions.
# XXX: the last minibatch with size < micro_batch_size kills us
if torch.is_tensor(batch[0]):
if batch[0].size(0) != self.micro_batch_size:
print(f'size mismatch: {batch[0].size(0)} mb: {self.micro_batch_size}')
assert batch[0].size(0) == self.micro_batch_size
return self._next_batch()
else:
assert torch.is_tensor(batch[0][0])
if batch[0][0].size(0) != self.micro_batch_size:
print(f'HB next_batch: {batch[0][0].shape} vs {self.micro_batch_size}', flush=True)
return self._next_batch()
return batch
def _exec_bps_forward_pass(self, buffer_id):
self.tput_timer.start()
self.mem_status('BEFORE FWD', reset_max=True)
self._profiling_func_enter('_exec_bps_forward_pass')
if isinstance(self.pipe_buffers['inputs'][buffer_id], tuple):
inputs = tuple(t.clone() for t in self.pipe_buffers['inputs'][buffer_id])
else:
inputs = self.pipe_buffers['inputs'][buffer_id].clone()
# collect the partitioned input from the previous stage
assert not self.is_pipe_partitioned
# Zero out the gradients each time we use the tensor because only the data in
# tensor changes across batches
self._zero_grads(inputs)
outputs = super(PipelineEngine, self).forward(inputs)
# Partition the outputs if we are not the last stage
assert not self.is_pipe_partitioned
self.pipe_buffers['outputs'][buffer_id] = outputs
# Optionally compute loss and metrics on the last device
if self.is_last_stage():
if self.loss_model is not None:
labels = self.pipe_buffers['labels'][buffer_id]
ret = self.loss_model(outputs, labels)
if isinstance(ret, dict):
self.result_dict = ret
self.loss = self.result_dict['loss']
else:
self.loss = ret
else:
# Some models just return loss from forward()
self.loss = outputs
# get metric from self.module
if isinstance(self.loss, torch.Tensor):
if self.total_loss is None:
self.total_loss = torch.zeros_like(self.loss)
self.total_loss += self.loss.detach()
else:
if self.total_loss is None:
self.total_loss = [torch.zeros_like(l) for l in self.loss]
for idx, l in enumerate(self.loss):
self.total_loss[idx] += l.detach()
self._profiling_func_exit()
def _exec_bps_backward_pass(self, buffer_id):
self._profiling_func_enter('_exec_bps_backward_pass')
assert self.optimizer is not None, "must provide optimizer during " \
"init in order to use backward"
self.mem_status('BEFORE BWD', reset_max=True)
# The last stage just runs backward on the loss using DeepSpeed's typical
# mechanisms.
if self.is_last_stage():
super(PipelineEngine, self).backward(self.loss)
self.mem_status('AFTER BWD')
self._profiling_func_exit()
return
outputs = self.pipe_buffers['outputs'][buffer_id]
if self.wall_clock_breakdown():
self.timers('backward_microstep').start()
self.timers('backward').start()
self.timers('backward_inner_microstep').start()
self.timers('backward_inner').start()
assert not self.is_pipe_partitioned
assert not self.is_grad_partitioned
# TODO: do we need to clone()?
grad_tensors = self.pipe_buffers['bps_grad_recv'][buffer_id]
if isinstance(outputs, tuple):
out_tensors = [t for t in outputs if t.is_floating_point()]
assert len(out_tensors) == len(grad_tensors)
new_out_tensors=[]
new_grad_tensors=[]
for t,g in zip(out_tensors, grad_tensors):
if t.requires_grad:
new_out_tensors.append(t)
new_grad_tensors.append(g)
assert len(new_out_tensors) == len(new_grad_tensors)
torch.autograd.backward(tensors=new_out_tensors, grad_tensors=new_grad_tensors)
else:
torch.autograd.backward(tensors=(outputs,), grad_tensors=(grad_tensors,))
# Free up the memory from the output of forward()
self.pipe_buffers['output_tensors'][buffer_id] = None
self.pipe_buffers['outputs'][buffer_id] = None
grad_tensors = None
if self.wall_clock_breakdown():
self.timers('backward_inner').stop()
self.timers('backward_inner_microstep').stop()
self.timers('backward').stop()
self.timers('backward_microstep').stop()
self.mem_status('AFTER BWD')
self._profiling_func_exit()
def _exec_load_micro_batch(self, buffer_id):
self._profiling_func_enter('_exec_load_micro_batch')
if self.wall_clock_breakdown():
self.timers('batch_input').start()
batch = self._next_batch()
if self.is_first_stage():
loaded = None
if torch.is_tensor(batch[0]):
loaded = batch[0].clone().to(self.device).detach()
loaded.requires_grad = loaded.is_floating_point()
if MEGATRON_DEBUG_DATA:
print(f'batch = {loaded.sum().detach()}', flush=True)
else:
assert isinstance(batch[0], tuple)
# Assume list or tuple
loaded = []
for x in batch[0]:
assert torch.is_tensor(x)
mine = x.clone().detach().to(self.device)
mine.requires_grad = mine.is_floating_point()
loaded.append(mine)
loaded = tuple(loaded)
if MEGATRON_DEBUG_DATA:
print(f'rank: {self.global_rank}, stage: {self.stage_id}, batch[0] = {[x.sum().detach() for x in loaded]}', flush=True)
self.pipe_buffers['inputs'][buffer_id] = loaded
if self.is_last_stage():
loaded = batch[1]
if torch.is_tensor(batch[1]):
loaded = batch[1].to(self.device)
if MEGATRON_DEBUG_DATA:
print(f'rank: {self.global_rank}, stage: {self.stage_id}, batch[1] = {[x.sum().detach() for x in loaded]}', flush=True)
elif isinstance(batch[1], tuple):
loaded = []
for x in batch[1]:
assert torch.is_tensor(x)
x = x.to(self.device).detach()
loaded.append(x)
loaded = tuple(loaded)
if MEGATRON_DEBUG_DATA:
print(f'rank: {self.global_rank}, stage: {self.stage_id}, batch[1] = {[x.sum().detach() for x in loaded]}', flush=True)
self.pipe_buffers['labels'][buffer_id] = loaded
if self.wall_clock_breakdown():
self.timers('batch_input').stop()
self._profiling_func_exit()
def _send_tensor_meta(self, buffer, recv_stage):
self._profiling_func_enter('_send_tensor_meta')
""" Communicate metadata about upcoming p2p transfers.
Metadata is communicated in this order:
* type (0: tensor, 1: list)
* num_tensors if type=list
foreach tensor in buffer:
* ndims
* shape
"""
send_bytes = 0
if isinstance(buffer, torch.Tensor):
type_tensor = torch.LongTensor(data=[0]).to(self.device)
p2p.send(type_tensor, recv_stage)
send_shape = torch.LongTensor(data=buffer.size()).to(self.device)
send_ndims = torch.LongTensor(data=[len(buffer.size())]).to(self.device)
send_dtype = torch.LongTensor(data=[_dtype_to_code(buffer.dtype)]).to(self.device)
p2p.send(send_ndims, recv_stage)
p2p.send(send_shape, recv_stage)
p2p.send(send_dtype, recv_stage)
send_bytes += _tensor_bytes(buffer)
elif isinstance(buffer, list):
assert (False)
type_tensor = torch.LongTensor(data=[1]).to(self.device)
p2p.send(type_tensor, recv_stage)
count_tensor = torch.LongTensor(data=[len(buffer)]).to(self.device)
p2p.send(count_tensor, recv_stage)
for tensor in buffer:
assert isinstance(tensor, torch.Tensor)
send_shape = torch.LongTensor(data=tensor.size()).to(self.device)
send_ndims = torch.LongTensor(data=[len(tensor.size())]).to(self.device)
send_dtype = torch.LongTensor(data=_dtype_to_code([tensor.dtype])).to(self.device)
p2p.send(send_ndims, recv_stage)
p2p.send(send_shape, recv_stage)
p2p.send(send_dtype, recv_stage)
send_bytes += _tensor_bytes(tensor)
elif isinstance(buffer, tuple):
type_tensor = torch.LongTensor(data=[2]).to(self.device)
p2p.send(type_tensor, recv_stage)
count_tensor = torch.LongTensor(data=[len(buffer)]).to(self.device)
p2p.send(count_tensor, recv_stage)
for idx, tensor in enumerate(buffer):
assert isinstance(tensor, torch.Tensor)
send_shape = torch.LongTensor(data=tensor.size()).to(self.device)
send_ndims = torch.LongTensor(data=[len(tensor.size())]).to(self.device)
send_dtype = torch.LongTensor(data=[_dtype_to_code(tensor.dtype)]).to(self.device)
p2p.send(send_ndims, recv_stage)
p2p.send(send_shape, recv_stage)
p2p.send(send_dtype, recv_stage)
# Useful for performance debugging.
'''
new_bytes = _tensor_bytes(tensor)
send_bytes += _tensor_bytes(tensor)
# Useful for performance debugging.
if self.grid.data_parallel_id == 0:
print(
f'STAGE={self.stage_id} pipe-send-volume[{idx}]: shape={send_shape} {new_bytes/1024**2:0.2f}MB'
)
'''
else:
raise NotImplementedError(f'Could not send meta type {type(buffer)}')
self._profiling_func_exit()
# Useful for performance debugging.
'''
if self.grid.data_parallel_id == 0:
print(f'STAGE={self.stage_id} pipe-send-volume: {send_bytes/1024**2:0.2f}MB')
'''
def _recv_tensor_meta(self, send_stage):
self._profiling_func_enter('_recv_tensor_meta')
"""Receive metadata about upcoming p2p transfers and return allocated buffers.
Metadata is communicated in this order:
* type (0: tensor, 1: list)
* num_tensors if type=list
foreach tensor in buffer:
* ndims
* shape
Returns:
Allocated buffer for receiving from send_stage.
"""
type_tensor = torch.LongTensor(data=[0]).to(self.device)
p2p.recv(type_tensor, send_stage)
recv_type = type_tensor.item()
# A single tensor will be sent.
if recv_type == 0:
recv_ndims = torch.LongTensor(data=[0]).to(self.device)
p2p.recv(recv_ndims, send_stage)
recv_ndims = recv_ndims.item()
recv_shape = torch.LongTensor([1] * recv_ndims).to(self.device)
p2p.recv(recv_shape, send_stage)
recv_shape = recv_shape.tolist()
recv_dtype = torch.LongTensor(data=[0]).to(self.device)
p2p.recv(recv_dtype, send_stage)
recv_dtype_code = recv_dtype.item()
recv_dtype = _code_to_dtype(recv_dtype_code)
return self._allocate_buffer2(recv_shape, recv_dtype, num_buffers=1)[0]
# List or tuple of tensors
elif recv_type == 1 or recv_type == 2:
count_tensor = torch.LongTensor(data=[0]).to(self.device)
p2p.recv(count_tensor, send_stage)
num_tensors = count_tensor.item()
recv_shapes = []
recv_dtypes = []
for idx in range(num_tensors):
recv_ndims = torch.LongTensor(data=[0]).to(self.device)
p2p.recv(recv_ndims, send_stage)
recv_ndims = recv_ndims.item()
recv_shape = torch.LongTensor([1] * recv_ndims).to(self.device)
p2p.recv(recv_shape, send_stage)
recv_shapes.append(recv_shape.tolist())
recv_dtype = torch.LongTensor(data=[0]).to(self.device)
p2p.recv(recv_dtype, send_stage)
recv_dtype_code = recv_dtype.item()
recv_dtype = _code_to_dtype(recv_dtype_code)
recv_dtypes.append(recv_dtype)
buffers = self._allocate_buffers2(recv_shapes, recv_dtypes, num_buffers=1)[0]
# Convert to tuples if requested.
if recv_type == 2:
buffers = tuple(buffers)
return buffers
else:
raise NotImplementedError(f'Could not receive type {type(recv_type)}')
self._profiling_func_exit()
def _mp_slice(self, x):
mp_size = self.grid.get_model_parallel_world_size()
return x.reshape((mp_size, -1))[self.mp_id:self.mp_id+1, :].detach()
def _mp_view(self, x, rank):
mp_size = self.grid.get_model_parallel_world_size()
return x.view((mp_size, -1))[rank:rank+1, :]
def _exec_bps_send_partitioned_activations(self, buffer_id):
self._profiling_func_enter('_exec_bps_send_activations')
if self.wall_clock_breakdown():
self.timers('pipe_send_output').start()
outputs = self.pipe_buffers['outputs'][buffer_id]
if self.first_output_send:
self.first_output_send = False
self._send_tensor_meta(outputs, self.next_stage)
assert not self.args.broadcast_activation
assert ENABLE_BPS_PARTITION
name = f'act_{buffer_id}'
if isinstance(outputs, torch.Tensor):
p2p.bps_send(self._mp_slice(outputs.contiguous()),
self.next_stage, name, index=0, async_op=True)
elif isinstance(outputs, (tuple, list)):
for idx, buffer in enumerate(outputs):
if DS_PIPE_VERBOSE >= 3:
print(f'DS BPS_SEND tensors {idx}/{len(outputs)}', flush=True)
p2p.bps_send(self._mp_slice(buffer.contiguous()), self.next_stage,
name, index=idx, async_op=True)
else:
raise NotImplementedError('Could not send output of type '
f'{type(outputs)}')
if self.wall_clock_breakdown():
self.timers('pipe_send_output').stop()
self._profiling_func_exit()
def _exec_bps_send_activations(self, buffer_id):
self._profiling_func_enter('_exec_bps_send_activations')
if self.wall_clock_breakdown():
self.timers('pipe_send_output').start()
outputs = self.pipe_buffers['outputs'][buffer_id]
if self.first_output_send:
self.first_output_send = False
self._send_tensor_meta(outputs, self.next_stage)
assert not self.args.broadcast_activation
assert not ENABLE_BPS_PARTITION
if self.mp_id == 0:
name = f'act_{buffer_id}'
if isinstance(outputs, torch.Tensor):
p2p.bps_send(outputs.contiguous(), self.next_stage, name, index=0, async_op=True)
elif isinstance(outputs, (tuple, list)):
for idx, buffer in enumerate(outputs):
if DS_PIPE_VERBOSE >= 3:
print(f'DS BPS_SEND tensors {idx}/{len(outputs)} start', flush=True)
p2p.bps_send(buffer.contiguous(), self.next_stage, name, index=idx, async_op=True)
if DS_PIPE_VERBOSE >= 3:
print(f'DS BPS_SEND tensors {idx}/{len(outputs)} end', flush=True)
else:
raise NotImplementedError('Could not send output of type '
f'{type(outputs)}')
if self.wall_clock_breakdown():
self.timers('pipe_send_output').stop()
self._profiling_func_exit()
def _exec_bps_send_grads(self, buffer_id):
self._profiling_func_enter('_exec_bps_send_grads')
if self.wall_clock_breakdown():
self.timers('pipe_send_grad').start()
inputs = self.pipe_buffers['inputs'][buffer_id]
# Partition the gradient
assert not self.is_grad_partitioned
assert not self.args.broadcast_grads
name = f'grad_{buffer_id}'
# only MP rank 0 sends the gradient
if self.grid.get_model_parallel_rank() == 0:
if isinstance(inputs, torch.Tensor):
if inputs.grad is None:
send_data = self._allocate_zeros(inputs.size())
else:
send_data = inputs.grad
assert send_data.is_floating_point()
assert send_data is not None
p2p.bps_send(send_data, self.prev_stage, name, index=0, async_op=True)
else:
for idx, buffer in enumerate(inputs):
if not buffer.is_floating_point():
continue
if buffer.grad is None:
send_data = self._allocate_zeros(buffer.size())
else:
send_data = buffer.grad
assert send_data.is_floating_point()
assert send_data is not None
p2p.bps_send(send_data, self.prev_stage, name, index=idx, async_op=True)
# We can free up the input buffer now
self.pipe_buffers['inputs'][buffer_id] = None
if self.wall_clock_breakdown():
self.timers('pipe_send_grad').stop()
self._profiling_func_exit()
def _exec_bps_send_partitioned_grads(self, buffer_id):
self._profiling_func_enter('_exec_bps_send_grads')
if self.wall_clock_breakdown():
self.timers('pipe_send_grad').start()
inputs = self.pipe_buffers['inputs'][buffer_id]
# Partition the gradient
assert not self.is_grad_partitioned
assert not self.args.broadcast_grads
assert ENABLE_BPS_PARTITION
name = f'grad_{buffer_id}'
if isinstance(inputs, torch.Tensor):
if inputs.grad is None:
send_data = self._allocate_zeros(inputs.size())
else:
send_data = inputs.grad
assert send_data.is_floating_point()
assert send_data is not None
p2p.bps_send(self._mp_slice(send_data), self.prev_stage, name,
index=0, async_op=True)
else:
for idx, buffer in enumerate(inputs):
if not buffer.is_floating_point():
continue
if buffer.grad is None:
send_data = self._allocate_zeros(buffer.size())
else:
send_data = buffer.grad
assert send_data.is_floating_point()
assert send_data is not None
p2p.bps_send(self._mp_slice(send_data), self.prev_stage,
name, index=idx, async_op=True)
# We can free up the input buffer now
self.pipe_buffers['inputs'][buffer_id] = None
if self.wall_clock_breakdown():
self.timers('pipe_send_grad').stop()
self._profiling_func_exit()
def _exec_bps_sync_all(self):
p2p.bps_sync_all()
def _exec_bps_sync_partitioned_grads(self, buffer_id):
name = f'grad_{buffer_id}'
recv_buff = self.pipe_buffers['bps_grad_recv'][buffer_id]
if isinstance(recv_buff, torch.Tensor):
p2p.bps_sync(self.next_stage, name, index=0)
else:
for i in range(len(recv_buff)):
p2p.bps_sync(self.next_stage, name, index=i)
# all_gather the gradient from other ranks
mp_size = self.grid.model_parallel_size
if mp_size > 1:
src_rank = self.grid.slice_parallel_src_id
group = self.grid.slice_proc_group
if isinstance(recv_buff, torch.Tensor):
recv_buff_views = [self._mp_view(recv_buff, i) for i in range(mp_size)]
dist.all_gather(recv_buff_views, recv_buff_views[self.mp_id].clone(),
group=group, async_op=False)
else:
for i in range(len(recv_buff)):
if recv_buff[i].is_floating_point():
recv_buff_views = [self._mp_view(recv_buff[i], j) for j in range(mp_size)]
dist.all_gather(recv_buff_views, recv_buff_views[self.mp_id].clone(),
group=group, async_op=False)
def _exec_bps_sync_grads(self, buffer_id):
name = f'grad_{buffer_id}'
recv_buff = self.pipe_buffers['bps_grad_recv'][buffer_id]
if self.mp_id == 0:
if isinstance(recv_buff, torch.Tensor):
p2p.bps_sync(self.next_stage, name, index=0)
else:
for i in range(len(recv_buff)):
p2p.bps_sync(self.next_stage, name, index=i)
# broadcast the activation at MP rank 0 to other ranks
if self.grid.model_parallel_size > 1:
src_rank = self.grid.slice_parallel_src_id
group = self.grid.slice_proc_group
if isinstance(recv_buff, torch.Tensor):
dist.broadcast(recv_buff, src_rank, group=group, async_op=False)
else:
for i in range(len(recv_buff)):
if recv_buff[i].is_floating_point():
dist.broadcast(recv_buff[i], src_rank, group=group, async_op=False)
def _exec_bps_sync_partitioned_activations(self, buffer_id):
recv_buff = self.pipe_buffers['bps_act_recv'][buffer_id]
recvd = None
src_rank = self.grid.slice_parallel_src_id
mp_size = self.grid.model_parallel_size
group = self.grid.slice_proc_group
name = f'act_{buffer_id}'
if isinstance(recv_buff, torch.Tensor):
p2p.bps_sync(self.prev_stage, name, index=0)
# broadcast the activation at MP rank 0 to other ranks
if mp_size > 1:
recv_buff_views = [self._mp_view(recv_buff, i) for i in range(mp_size)]
dist.all_gather(recv_buff_views, recv_buff_views[self.mp_id].clone(),
group=group, async_op=False)
recvd = recv_buff.clone().detach()
recvd.requires_grad = recv_buff.is_floating_point()
else:
recvd = [None] * len(recv_buff)
for i in range(len(recv_buff)):
p2p.bps_sync(self.prev_stage, name, index=i)
# broadcast the activation at MP rank 0 to other ranks
if mp_size > 1:
recv_buff_views = [self._mp_view(recv_buff[i], j) for j in range(mp_size)]
dist.all_gather(recv_buff_views, recv_buff_views[self.mp_id].clone(),
group=group, async_op=False)
recvd[i] = recv_buff[i].clone().detach()
recvd = tuple(recvd)
for buffer in recvd:
buffer.requires_grad = buffer.is_floating_point()
self.pipe_buffers['inputs'][buffer_id] = recvd
def _exec_bps_sync_activations(self, buffer_id):
recv_buff = self.pipe_buffers['bps_act_recv'][buffer_id]
recvd = None
src_rank = self.grid.slice_parallel_src_id
group = self.grid.slice_proc_group
name = f'act_{buffer_id}'
if isinstance(recv_buff, torch.Tensor):
if self.mp_id == 0:
p2p.bps_sync(self.prev_stage, name, index=0)
# broadcast the activation at MP rank 0 to other ranks
if self.grid.model_parallel_size > 1:
dist.broadcast(recv_buff, src_rank, group=group, async_op=False)
recvd = recv_buff.clone().detach()
recvd.requires_grad = recv_buff.is_floating_point()
else:
recvd = [None] * len(recv_buff)
for i in range(len(recv_buff)):
if self.mp_id == 0:
p2p.bps_sync(self.prev_stage, name, index=i)
# broadcast the activation at MP rank 0 to other ranks
if self.grid.model_parallel_size > 1:
dist.broadcast(recv_buff[i], src_rank, group=group, async_op=False)
recvd[i] = recv_buff[i].clone().detach()
recvd = tuple(recvd)
for buffer in recvd:
buffer.requires_grad = buffer.is_floating_point()
self.pipe_buffers['inputs'][buffer_id] = recvd
def _exec_bps_recv_partitioned_activations(self, buffer_id):
self._profiling_func_enter('_exec_bps_recv_activations')
if self.wall_clock_breakdown():
self.timers('pipe_recv_input').start()
recv_buffs = self.pipe_buffers['bps_act_recv']
# Allocate the buffer if necessary
if recv_buffs[buffer_id] is None:
if recv_buffs[0] is None:
recv_buffs[buffer_id] = self._recv_tensor_meta(self.prev_stage)
else:
if torch.is_tensor(recv_buffs[0]):
recv_buffs[buffer_id] = recv_buffs[0].clone().detach()
else:
recv_buffs[buffer_id] = tuple([x.clone().detach() for x in recv_buffs[0]])
assert not self.args.broadcast_activation
assert not self.is_pipe_partitioned
recv_buff = recv_buffs[buffer_id]
name = f'act_{buffer_id}'
if isinstance(recv_buff, torch.Tensor):
p2p.bps_recv(self._mp_view(recv_buff, self.mp_id), self.prev_stage,
name, index=0, async_op=True)
else:
assert isinstance(recv_buff, (tuple, list))
for idx, buffer in enumerate(recv_buff):
assert torch.is_tensor(buffer)
p2p.bps_recv(self._mp_view(buffer, self.mp_id), self.prev_stage,
name, index=idx, async_op=True)
if self.wall_clock_breakdown():
self.timers('pipe_recv_input').stop()
self._profiling_func_exit()
def _exec_bps_recv_activations(self, buffer_id):
self._profiling_func_enter('_exec_bps_recv_activations')
if self.wall_clock_breakdown():
self.timers('pipe_recv_input').start()
recv_buffs = self.pipe_buffers['bps_act_recv']
# Allocate the buffer if necessary
if recv_buffs[buffer_id] is None:
if recv_buffs[0] is None:
recv_buffs[buffer_id] = self._recv_tensor_meta(self.prev_stage)
else:
if torch.is_tensor(recv_buffs[0]):
recv_buffs[buffer_id] = recv_buffs[0].clone().detach()
else:
recv_buffs[buffer_id] = tuple([x.clone().detach() for x in recv_buffs[0]])
assert not self.args.broadcast_activation
assert not self.is_pipe_partitioned
recv_buff = recv_buffs[buffer_id]
if self.mp_id == 0:
name = f'act_{buffer_id}'
if isinstance(recv_buff, torch.Tensor):
p2p.bps_recv(recv_buff, self.prev_stage, name, index=0, async_op=True)
else:
assert isinstance(recv_buff, (tuple, list))
for idx, buffer in enumerate(recv_buff):
assert torch.is_tensor(buffer)
p2p.bps_recv(buffer, self.prev_stage, name, index=idx, async_op=True)
if self.wall_clock_breakdown():
self.timers('pipe_recv_input').stop()
self._profiling_func_exit()
def _exec_bps_recv_partitioned_grads(self, buffer_id):
self._profiling_func_enter('_exec_bps_recv_grads')
if self.wall_clock_breakdown():
self.timers('pipe_recv_grad').start()
outputs = self.pipe_buffers['outputs'][buffer_id]
grad_buffs = self.pipe_buffers['bps_grad_recv']
# Restore partitioned output if it was partitioned and we are sending full gradients
assert not self.is_pipe_partitioned
assert not self.is_grad_partitioned
assert not self.args.broadcast_grads
assert ENABLE_BPS_PARTITION
# Allocate gradient if necessary
if grad_buffs[buffer_id] is None:
if isinstance(outputs, torch.Tensor):
s = list(outputs.size())
grad_buffs[buffer_id] = self._allocate_buffer(s, num_buffers=1)[0]
else:
sizes = [list(t.size()) for t in outputs if t.is_floating_point()]
grad_buffs[buffer_id] = self._allocate_buffers(sizes, num_buffers=1)[0]
grad_buff = grad_buffs[buffer_id]
name = f'grad_{buffer_id}'
if isinstance(grad_buff, torch.Tensor):
p2p.bps_recv(self._mp_view(grad_buff, self.mp_id), self.next_stage,
name, index=0, async_op=True)
else:
assert isinstance(outputs, tuple)
recv_idx = 0
for idx, buffer in enumerate(grad_buff):
p2p.bps_recv(self._mp_view(buffer, self.mp_id), self.next_stage,
name, index=recv_idx, async_op=True)
recv_idx += 1
if self.wall_clock_breakdown():
self.timers('pipe_recv_grad').stop()
self._profiling_func_exit()
def _exec_bps_recv_grads(self, buffer_id):
self._profiling_func_enter('_exec_bps_recv_grads')
if self.wall_clock_breakdown():
self.timers('pipe_recv_grad').start()
outputs = self.pipe_buffers['outputs'][buffer_id]
grad_buffs = self.pipe_buffers['bps_grad_recv']
# Restore partitioned output if it was partitioned and we are sending full gradients
assert not self.is_pipe_partitioned
assert not self.is_grad_partitioned
assert not self.args.broadcast_grads
# Allocate gradient if necessary
if grad_buffs[buffer_id] is None:
if isinstance(outputs, torch.Tensor):
s = list(outputs.size())
grad_buffs[buffer_id] = self._allocate_buffer(s, num_buffers=1)[0]
else:
sizes = [list(t.size()) for t in outputs if t.is_floating_point()]
grad_buffs[buffer_id] = self._allocate_buffers(sizes, num_buffers=1)[0]
grad_buff = grad_buffs[buffer_id]
name = f'grad_{buffer_id}'
if isinstance(grad_buff, torch.Tensor):
if self.mp_id == 0:
p2p.bps_recv(grad_buff, self.next_stage, name, index=0, async_op=True)
else:
assert isinstance(outputs, tuple)
recv_idx = 0
if self.mp_id == 0:
for idx, buffer in enumerate(grad_buff):
p2p.bps_recv(buffer, self.next_stage, name, index=recv_idx, async_op=True)
recv_idx += 1
if self.wall_clock_breakdown():
self.timers('pipe_recv_grad').stop()
self._profiling_func_exit()
def _exec_optimizer_step(self, lr_kwargs=None):
self._profiling_func_enter('_exec_optimizer_step')
if self.wall_clock_breakdown():
self.timers('step_microstep').start()
self.timers('step').start()
self.mem_status('BEFORE STEP', reset_max=True)
if self.global_rank == 0 and MEGATRON_DEBUG_GRAD:
params = list(self.module.named_parameters())
for i in (0, 1, -2, -1):
p = params[i]
if p[1] is None:
print(f'name={p[0]} | None', flush=True)
elif p[1].grad is None:
print(f'name={p[0]} | weight={p[1].mean()}', flush=True)
else:
print(f'name={p[0]} | weight={p[1].norm()} | grad={p[1].grad.norm()}', flush=True)
params_w_grad = []
params_wo_grad = []
for p in params:
if p[1].grad is not None:
params_w_grad.append(p[0])
else:
params_wo_grad.append(p[0])
self._force_grad_boundary = True
self._take_model_step(lr_kwargs)
self._force_grad_boundary = False
self.mem_status('AFTER STEP')
if self.tensorboard_enabled():
if self.global_rank == 0:
self.summary_events = [(f'Train/Samples/lr',
self.get_lr()[0],
self.global_samples)]
if self.fp16_enabled() and hasattr(self.optimizer, 'cur_scale'):
self.summary_events.append((f'Train/Samples/loss_scale',
self.optimizer.cur_scale,
self.global_samples))
for event in self.summary_events: # write_summary_events
self.summary_writer.add_scalar(event[0], event[1], event[2])
if self.wall_clock_breakdown():
self.timers('step_microstep').stop()
self.timers('step').stop()
if self.global_steps % self.steps_per_print() == 0:
self.timers.log([
'batch_input',
'forward_microstep',
'backward_microstep',
'backward_inner_microstep',
'backward_allreduce_microstep',
'backward_tied_allreduce_microstep',
'step_microstep'
])
if self.global_steps % self.steps_per_print() == 0:
self.timers.log([
'forward',
'backward',
'backward_inner',
'backward_allreduce',
'step'
])
self._profiling_func_exit()
def _zero_grads(self, inputs):
if isinstance(inputs, torch.Tensor):
if inputs.grad is not None:
inputs.grad.data.zero_()
else:
for t in inputs:
if t.grad is not None:
t.grad.data.zero_()
def _allocate_zeros(self, shape, fp16=None, **kwargs):
""" Allocate a tensor of zeros on the engine's device.
Arguments:
shape: the shape of the tensor to allocate
fp16 (bool): whether to use FP16. default: defer to self.fp16_enabled()
kwargs: passed to torch.zeros()
Returns:
A tensor from torch.zeros() allocated on self.device.
"""
if fp16 is None:
fp16 = self.fp16_enabled()
if fp16:
return torch.zeros(shape, dtype=torch.half, device=self.device, **kwargs)
else:
return torch.zeros(shape, device=self.device, **kwargs)
def _allocate_zeros2(self, shape, dtype, **kwargs):
return torch.zeros(shape, dtype=dtype, device=self.device, **kwargs)
def _allocate_buffer(self, shape, num_buffers=-1, **kwargs):
buffers = []
if num_buffers == -1:
num_buffers = self.num_pipe_buffers
for count in range(num_buffers):
buffers.append(self._allocate_zeros(shape, **kwargs))
return buffers
def _allocate_buffer2(self, shape, dtype, num_buffers=-1, **kwargs):
buffers = []
if num_buffers == -1:
num_buffers = self.num_pipe_buffers
for count in range(num_buffers):
buffers.append(self._allocate_zeros2(shape, dtype, **kwargs))
return buffers
def _allocate_buffers(self, shapes, requires_grad=False, num_buffers=-1):
buffers = []
if num_buffers == -1:
num_buffers = self.num_pipe_buffers
for count in range(num_buffers):
buffer = []
for shape in shapes:
buffer.append(self._allocate_zeros(shape, requires_grad=requires_grad))
buffers.append(buffer)
return buffers
def _allocate_buffers2(self, shapes, dtypes, requires_grad=False, num_buffers=-1):
buffers = []
if num_buffers == -1:
num_buffers = self.num_pipe_buffers
for count in range(num_buffers):
buffer = []
for i in range(len(shapes)):
buffer.append(self._allocate_zeros2(shapes[i], dtypes[i], requires_grad=requires_grad))
buffers.append(buffer)
return buffers
def forward(self, *args, **kwargs):
"""Disabled for pipeline parallel training. See ``train_batch()``. """
raise PipelineError("Only train_batch() is accessible in pipeline mode.")
def backward(self, *args, **kwargs):
"""Disabled for pipeline parallel training. See ``train_batch()``. """
raise PipelineError("Only train_batch() is accessible in pipeline mode.")
def step(self, *args, **kwargs):
"""Disabled for pipeline parallel training. See ``train_batch()``. """
raise PipelineError("Only train_batch() is accessible in pipeline mode.")
# A map of PipeInstruction types to methods. Each method will be executed with the
# kwargs provided to the PipeInstruction from the scheduler.
_INSTRUCTION_MAP = {
schedule.OptimizerStep: _exec_optimizer_step,
schedule.ReduceGrads: _exec_reduce_grads,
schedule.ReduceTiedGrads: _exec_reduce_tied_grads,
schedule.LoadMicroBatch: _exec_load_micro_batch,
schedule.BytePSForwardPass: _exec_bps_forward_pass,
schedule.BytePSBackwardPass: _exec_bps_backward_pass,
schedule.BytePSSendActivation: _exec_bps_send_partitioned_activations if ENABLE_BPS_PARTITION else _exec_bps_send_activations,
schedule.BytePSRecvActivation: _exec_bps_recv_partitioned_activations if ENABLE_BPS_PARTITION else _exec_bps_recv_activations,
schedule.BytePSSyncActivation: _exec_bps_sync_partitioned_activations if ENABLE_BPS_PARTITION else _exec_bps_sync_activations,
schedule.BytePSSyncGrad: _exec_bps_sync_partitioned_grads if ENABLE_BPS_PARTITION else _exec_bps_sync_grads,
schedule.BytePSSendGrad: _exec_bps_send_partitioned_grads if ENABLE_BPS_PARTITION else _exec_bps_send_grads,
schedule.BytePSRecvGrad: _exec_bps_recv_partitioned_grads if ENABLE_BPS_PARTITION else _exec_bps_recv_grads,
schedule.BytePSSyncAll: _exec_bps_sync_all
}
def _exec_schedule(self, pipe_schedule):
self._reserve_pipe_buffers(pipe_schedule.num_pipe_buffers())
# For each step in the schedule
has_optim_step = False
for step_cmds in pipe_schedule:
# For each instruction in the step
for cmd in step_cmds:
if isinstance(cmd, schedule.OptimizerStep):
has_optim_step = True
if DS_PIPE_VERBOSE:
if "buffer_id" in cmd.kwargs:
print(f'[{self.grid.get_global_rank()}] | cmd={cmd.__class__.__name__} | {cmd.kwargs["buffer_id"]}', flush=True)
else:
print(f'[{self.grid.get_global_rank()}] | cmd={cmd.__class__.__name__}', flush=True)
if type(cmd) not in self._INSTRUCTION_MAP:
raise RuntimeError(
f'{self.__class__.__name__} does not understand instruction {repr(cmd)}'
)
self._exec_instr = MethodType(self._INSTRUCTION_MAP[type(cmd)], self)
self._exec_instr(**cmd.kwargs)
# check for anormalies
if isinstance(pipe_schedule, (schedule.BytePSTrainSchedule, schedule.TrainSchedule)):
assert has_optim_step
================================================
FILE: src/veGiantModel/engine/module.py
================================================
# Copyright (c) 2021, ByteDance Inc. All rights reserved.
# Copyright 2019 The Microsoft DeepSpeed Team
import os
import re as regex
from functools import partial
import torch
import torch.nn as nn
import torch.distributed as dist
from math import floor
from deepspeed.utils import logger
from deepspeed.runtime import utils as ds_utils
from deepspeed.runtime.activation_checkpointing import checkpointing
from deepspeed.pipe import PipelineModule,LayerSpec, TiedLayerSpec
from .topology import PipeDataParallelTopology, PipelineParallelGrid
class VeGiantModule(PipelineModule):
def __init__(self,
layers,
num_stages=None,
loss_fn=None,
seed_layers=False,
seed_fn=None,
base_seed=1234,
grid=None,
partition_method='parameters',
activation_checkpoint_interval=0,
activation_checkpoint_func=checkpointing.checkpoint):
"""Modules to be parallelized with pipeline parallelism.
The key constraint that enables pipeline parallelism is the
representation of the forward pass as a sequence of layers
and the enforcement of a simple interface between them. The
forward pass is implicitly defined by the module ``layers``. The key
assumption is that the output of each layer can be directly fed as
input to the next, like a ``torch.nn.Sequence``. The forward pass is
implicitly:
.. code-block:: python
def forward(self, inputs):
x = inputs
for layer in self.layers:
x = layer(x)
return x
Args:
layers (Iterable): A sequence of layers defining pipeline structure. Can be a ``torch.nn.Sequential`` module.
num_stages (int, optional): The degree of pipeline parallelism. If not specified, ``topology`` must be provided.
topology (``deepseed.pipe.ProcessTopology``, optional): Defines the axes of parallelism axes for training. Must be provided if ``num_stages`` is ``None``.
loss_fn (callable, optional): Loss is computed ``loss = loss_fn(outputs, label)``
base_seed (int, optional): [description]. Defaults to 1234.
partition_method (str, optional): [description]. Defaults to 'parameters'.
activation_checkpoint_interval (int, optional): The granularity activation checkpointing in terms of number of layers. 0 disables activation checkpointing.
activation_checkpoint_func (callable, optional): The function to use for activation checkpointing. Defaults to ``deepspeed.checkpointing.checkpoint``.
"""
super(PipelineModule, self).__init__()
topology = grid.topology() if grid is not None else None
if num_stages is None and topology is None:
raise RuntimeError('must provide num_stages or topology')
self.micro_offset = 0
self.loss_fn = loss_fn
self.seed_layers = seed_layers
self.seed_fn = seed_fn
self.base_seed = base_seed
if dist.get_rank() == 0:
try:
seed_str = self.seed_fn.__name__
except AttributeError:
seed_str = None
print(
f'SEED_LAYERS={self.seed_layers} BASE_SEED={self.base_seed} SEED_FN={seed_str}'
)
# Setup world info
self.world_group = dist.new_group(ranks=range(dist.get_world_size()))
self.global_rank = dist.get_rank(group=self.world_group)
self.world_size = dist.get_world_size(group=self.world_group)
if topology:
self._topo = topology
self.num_stages = self._topo.get_dim('pipe')
else:
self.num_stages = num_stages
if topology is None:
if self.world_size % self.num_stages != 0:
raise RuntimeError(
f'num_stages ({self.num_stages}) must divide distributed world size ({self.world_size})'
)
dp = self.world_size // num_stages
topology = PipeDataParallelTopology(num_pp=num_stages, num_dp=dp)
self._topo = topology
# Contruct communicators for pipeline topology
self._grid = grid if grid is not None else PipelineParallelGrid(process_group=self.world_group, topology=self._topo)
self.stage_id = self._topo.get_coord(self.global_rank).pipe
# Initialize partition information
self._layer_specs = list(layers)
self._num_layers = len(self._layer_specs)
self._local_start = 0
self._local_stop = None
self._partition_layers(method=partition_method)
self.forward_funcs = []
self.tied_modules = nn.ModuleDict()
self.tied_weight_attrs = {}
# Offset the random seed by the stage ID.
#newseed = torch.cuda.initial_seed() + self._grid.get_stage_id()
#ds_utils.set_random_seed(newseed)
#with torch.random.fork_rng(devices=[torch.cuda.current_device()]):
self._build()
self.to('cuda')
self.tied_comms = self._index_tied_modules()
self._synchronize_tied_weights()
self.activation_checkpoint_interval = activation_checkpoint_interval
self.activation_checkpoint_func = activation_checkpoint_func
def _build(self):
specs = self._layer_specs
for local_idx, layer in enumerate(specs[self._local_start:self._local_stop]):
layer_idx = local_idx + self._local_start
if self.seed_layers:
if self.seed_fn:
self.seed_fn(self.base_seed + layer_idx)
else:
ds_utils.set_random_seed(self.base_seed + layer_idx)
# Recursively build PipelineModule objects
if isinstance(layer, PipelineModule):
raise NotImplementedError('RECURSIVE BUILD NOT YET IMPLEMENTED')
# LayerSpec objects contain an nn.Module that should be allocated now.
elif isinstance(layer, nn.Module):
name = str(layer_idx)
self.forward_funcs.append(layer)
self.add_module(name, layer)
# TiedLayerSpec objects contain an nn.Module that should be allocated now.
elif isinstance(layer, TiedLayerSpec):
# Build and register the module if we haven't seen it before.
if layer.key not in self.tied_modules:
self.tied_modules[layer.key] = layer.build()
self.tied_weight_attrs[layer.key] = layer.tied_weight_attr
if layer.forward_fn is None:
# Just use forward()
self.forward_funcs.append(self.tied_modules[layer.key])
else:
# User specified fn with args (module, input)
self.forward_funcs.append(
partial(layer.forward_fn,
self.tied_modules[layer.key]))
# LayerSpec objects contain an nn.Module that should be allocated now.
elif isinstance(layer, LayerSpec):
module = layer.build()
name = str(layer_idx)
self.forward_funcs.append(module)
self.add_module(name, module)
# Last option: layer may be a functional (e.g., lambda). We do nothing in
# that case and just use it in forward()
else:
self.forward_funcs.append(layer)
# All pipeline parameters should be considered as model parallel in the context
# of our FP16 optimizer
for p in self.parameters():
p.model_parallel = True
def _count_layer_params(self):
"""Count the trainable parameters in individual layers.
This routine will only build one layer at a time.
Returns:
A list of the number of parameters in each layer.
"""
param_counts = [0] * len(self._layer_specs)
for idx, layer in enumerate(self._layer_specs):
if isinstance(layer, LayerSpec):
l = layer.build()
params = filter(lambda p: p.requires_grad, l.parameters())
param_counts[idx] = sum(p.numel() for p in params)
elif isinstance(layer, nn.Module):
params = filter(lambda p: p.requires_grad, layer.parameters())
param_counts[idx] = sum(p.numel() for p in params)
return param_counts
def _find_layer_type(self, layername):
idxs = []
typeregex = regex.compile(layername, regex.IGNORECASE)
for idx, layer in enumerate(self._layer_specs):
name = None
if isinstance(layer, LayerSpec):
name = layer.typename.__name__
elif isinstance(layer, nn.Module):
name = layer.__class__.__name__
else:
try:
name = layer.__name__
except AttributeError:
continue
if typeregex.search(name):
idxs.append(idx)
if len(idxs) == 0:
raise RuntimeError(
f"Partitioning '{layername}' found no valid layers to partition.")
return idxs
def forward(self, forward_input):
# We need to offset the seed by the microbatch ID. Save it in a local var to
# ensure it is preserved in the closure. Otherwise checkpointed forward funcs
# will see a different offset.
self.micro_offset += 1
def exec_range_func(start, end):
''' Helper function to be used with checkpoint()
Adapted from torch.utils.checkpoint:checkpoint_sequential()
'''
local_micro_offset = self.micro_offset + 1
def exec_func(*inputs):
# Single tensor inputs need to be unwrapped
if len(inputs) == 1:
inputs = inputs[0]
for idx, layer in enumerate(self.forward_funcs[start:end]):
self.curr_layer = idx + self._local_start
if self.seed_layers:
new_seed = (self.base_seed *
local_micro_offset) + self.curr_layer
if self.seed_fn:
self.seed_fn(new_seed)
else:
ds_utils.set_random_seed(new_seed)
inputs = layer(inputs)
return inputs
return exec_func
if self.activation_checkpoint_interval == 0:
func = exec_range_func(0, len(self.forward_funcs))
x = func(forward_input)
else:
num_layers = len(self.forward_funcs)
x = forward_input
for start_idx in range(0, num_layers, self.activation_checkpoint_interval):
end_idx = min(start_idx + self.activation_checkpoint_interval,
num_layers)
funcs = self.forward_funcs[start_idx:end_idx]
# Since we either pass tensors or tuples of tensors without unpacking, we
# need to be careful not to double-wrap tensors with tuple.
if not isinstance(x, tuple):
x = (x, )
if self._is_checkpointable(funcs):
x = self.activation_checkpoint_func(
exec_range_func(start_idx,
end_idx),
*x)
else:
x = exec_range_func(start_idx, end_idx)(*x)
return x
def _partition_uniform(self, num_items, num_parts):
# print(f'enter _partition_uniform', flush=True)
parts = [0] * (num_parts + 1)
if num_items <= num_parts:
for p in range(num_parts + 1):
parts[p] = min(p, num_items)
return parts
expected_chunksize = num_items / num_parts
for p in range(num_parts):
parts[p] = min(floor(expected_chunksize * p), num_items)
parts[num_parts] = num_items
return parts
def _partition_balanced(self, weights, num_parts, eps=1e-3):
num_items = len(weights)
# First check for the trivial edge case
if num_items <= num_parts:
return self._partition_uniform(num_items, num_parts)
weights_ = ds_utils.prefix_sum_inc(weights)
# Find the smallest bottleneck (weight of heaviest partition)
bottleneck = ds_utils._rb_partition_balanced(weights_, num_parts, eps=eps)
# Now compute that partitioning
parts, success = ds_utils._lprobe(weights_, num_parts, bottleneck)
assert success
return parts
def _partition_layers(self, method='uniform'):
num_stages = self._topo.get_dim('pipe')
stage_id = self._topo.get_coord(self.global_rank).pipe
if self.global_rank == 0:
logger.info(f'Partitioning pipeline stages with method {method}')
method = method.lower()
# Each stage gets a simple uniform number of layers.
if method == 'uniform':
num_layers = len(self._layer_specs)
self.parts = self._partition_uniform(num_items=num_layers,
num_parts=num_stages)
elif method == 'parameters':
param_counts = self._count_layer_params()
self.parts = self._partition_balanced(weights=param_counts,
num_parts=num_stages)
elif method.startswith('type:'):
layertype = method.split(':')[1]
binary_weights = [0] * len(self._layer_specs)
for idx in self._find_layer_type(layertype):
binary_weights[idx] = 1
else:
self.parts = self._partition_balanced(weights=binary_weights,
num_parts=num_stages)
elif method.startswith('manual:'):
msplit = method.split(':')
layernum = int(msplit[1])
layerparts = msplit[2].split(',')
assert len(self._layer_specs) == layernum # failsafe check for layer num
assert num_stages == len(layerparts)-1 # failsafe check for num stages
self.parts = list(map(int, layerparts))
elif method == 'profile':
raise NotImplementedError(f'Partitioning method {method} not implemented.')
else:
raise NotImplementedError(f'Partitioning method {method} not implemented.')
# Print some information on the partitioning.
if self.global_rank == 0:
for stage in range(num_stages):
start = self.parts[stage]
stop = self.parts[stage + 1]
print(f'stage={stage} layers={stop - start}')
for idx, layer in enumerate(self._layer_specs[start:stop]):
name = str(layer)
if isinstance(layer, LayerSpec):
name = layer.typename.__name__
if isinstance(layer, nn.Module):
name = layer.__class__.__name__
else:
try:
name = layer.__name__
except AttributeError:
pass
print(f' {idx+start:2d}: {name}')
if self.loss_fn:
try:
print(f' loss: {self.loss_fn.__name__}')
except AttributeError:
print(f' loss: {self.loss_fn.__class__.__name__}')
self._set_bounds(start=self.parts[stage_id], stop=self.parts[stage_id + 1])
def allreduce_tied_weight_gradients(self):
'''All reduce the gradients of the tied weights between tied stages'''
for key, comm in self.tied_comms.items():
weight = getattr(self.tied_modules[key], comm['weight_attr'])
dist.all_reduce(weight.grad, group=comm['group'])
def _synchronize_tied_weights(self):
for key, comm in self.tied_comms.items():
dist.broadcast(
getattr(comm['module'],
comm['weight_attr']),
src=min(comm['ranks']),
group=comm['group'],
)
def _index_tied_modules(self):
''' Build communication structures for tied modules. '''
tied_comms = {}
if self._topo.get_dim('pipe') == 1:
return tied_comms
specs = self._layer_specs
tie_keys = set(s.key for s in specs if isinstance(s, TiedLayerSpec))
for key in tie_keys:
# Find the layers that the tied module appears in
tied_layers = []
for idx, layer in enumerate(specs):
if isinstance(layer, TiedLayerSpec) and layer.key == key:
tied_layers.append(idx)
# Find all stages with this tied module
# TODO: Would be nice to remove the nested data/model parallelism loops and
# TODO: instead generalize in some way, since we really just care about the
# TODO: stage that owns the tied layer. Then loop over each (dp, mp, ...)
# TODO: fiber to generate process groups.
tied_stages = set(self.stage_owner(idx) for idx in tied_layers)
for dp in range(self._grid.data_parallel_size):
for mp in range(self._grid.model_parallel_size):
tied_ranks = []
for s in sorted(tied_stages):
if self._grid.model_parallel_size > 1:
tied_ranks.append(
self._grid.stage_to_global(stage_id=s,
data=dp,
model=mp))
else:
tied_ranks.append(
self._grid.stage_to_global(stage_id=s,
data=dp))
group = dist.new_group(ranks=tied_ranks)
# Record this tied module if we own a local copy of it.
if self.global_rank in tied_ranks:
assert key in self.tied_modules
if key in self.tied_modules:
tied_comms[key] = {
'ranks': tied_ranks,
'group': group,
'weight_attr': self.tied_weight_attrs[key],
'module': self.tied_modules[key],
}
# Only count the tied module once in the eyes of the FP16 optimizer
if self.global_rank != tied_ranks[0]:
for p in self.tied_modules[key].parameters():
p.model_parallel = False
'''
if len(tied_comms) > 0:
print(f'RANK={self.global_rank} tied_comms={tied_comms}')
'''
return tied_comms
def partitions(self):
return self.parts
def stage_owner(self, layer_idx):
assert 0 <= layer_idx < self._num_layers
for stage in range(self._topo.get_dim('pipe')):
if self.parts[stage] <= layer_idx < self.parts[stage + 1]:
return stage
raise RuntimeError(f'Layer {layer_idx} not owned? parts={self.parts}')
def _set_bounds(self, start=None, stop=None):
"""Manually define the range of layers that will be built on this process.
These boundaries are treated as list slices and so start is inclusive and stop is
exclusive. The default of None for both results in all layers being built
locally.
"""
self._local_start = start
self._local_stop = stop
def set_checkpoint_interval(self, interval):
assert interval >= 0
self.checkpoint_interval = interval
def topology(self):
""" ProcessTopology object to query process mappings. """
return self._topo
def mpu(self):
return self._grid
def num_pipeline_stages(self):
return self._topo.get_dim('pipe')
def ckpt_prefix(self, checkpoints_path, tag):
"""Build a prefix for all checkpoint files written by this module. """
# All checkpoint files start with this
rank_name = 'module'
# Data parallelism is omitted from the naming convention because we are agnostic
# to this in the checkpoint.
omit_dims = frozenset(['data'])
axes = [a for a in self._grid._topo.get_axis_names() if a not in omit_dims]
for dim in axes:
rank = getattr(self._grid._topo.get_coord(rank=self.global_rank), dim)
rank_name += f'-{dim}_{rank:02d}'
ckpt_name = os.path.join(checkpoints_path, str(tag), rank_name)
return ckpt_name
def ckpt_layer_path(self, ckpt_dir, local_layer_idx):
"""Customize a prefix for a specific pipeline module layer. """
idx = local_layer_idx + self._local_start
layer_ckpt_path = os.path.join(ckpt_dir, f'layer_{idx:02d}')
rank_repr = self._grid._topo.get_rank_repr(rank=self.global_rank)
if rank_repr is not '':
layer_ckpt_path += f'-{rank_repr}'
layer_ckpt_path += '-model_states.pt'
return layer_ckpt_path
def save_state_dict(self, save_dir):
if self._grid.data_parallel_id != 0:
return
os.makedirs(save_dir, exist_ok=True)
layer_offset = self._local_start
for idx, layer in enumerate(self.forward_funcs):
model_ckpt_path = self.ckpt_layer_path(save_dir, idx)
if not hasattr(layer, 'state_dict'):
continue
torch.save(layer.state_dict(), model_ckpt_path)
def load_state_dir(self, load_dir, strict=True):
rank = dist.get_rank()
layer_offset = self._local_start
for idx, layer in enumerate(self.forward_funcs):
# Functions, etc. will not have state_dicts
if not hasattr(layer, 'load_state_dict'):
continue
model_ckpt_path = self.ckpt_layer_path(load_dir, idx)
layer.load_state_dict(torch.load(model_ckpt_path,
map_location=lambda storage,
loc: storage),
strict=strict)
if self._grid.data_parallel_id == 0:
logger.info(
f'RANK={self.global_rank} Loaded layer={idx+layer_offset} file={model_ckpt_path}'
)
self._synchronize_tied_weights()
def _is_checkpointable(self, funcs):
if self.__class__.__name__ == 'GPT2ModelPipe':
return all('ParallelTransformerLayerPipe' in f.__class__.__name__
for f in funcs)
params = [f.parameters() for f in funcs if isinstance(f, torch.nn.Module)]
return any(len(list(p)) > 0 for p in params)
================================================
FILE: src/veGiantModel/engine/p2p.py
================================================
# Copyright (c) 2021, ByteDance Inc. All rights reserved.
# Copyright 2019 The Microsoft DeepSpeed Team
'''
Copyright 2019 The Microsoft DeepSpeed Team
'''
import os
import torch
import torch.distributed as dist
from deepspeed.utils import logger, log_dist
ENABLE_PYTORCH_BROADCAST = os.environ.get("ENABLE_PYTORCH_BROADCAST", "0") != "0"
try:
if not ENABLE_PYTORCH_BROADCAST:
import byteps.torch as bps
else:
print("BytePS import is disabled", flush=True)
bps = None
except ImportError:
print("BytePS is not installed")
bps = None
_groups = None
_grid = None
DS_PIPE_VERBOSE = os.environ.get('DS_PIPE_VERBOSE', "0") != "0"
did_recv = False
send_stream = None
recv_stream = None
bps_send_handles = {}
bps_recv_handles = {}
#initializes adjacent process groups
#run this only after torch.distributed.init_process_group() has been called
def init_process_groups(grid):
global _groups, _grid
_grid = grid
assert _grid.pipe_parallel_size > 1, "There is no model parallelism"
_groups = [dist.new_group(ranks=group) for group in _grid.p2p_groups]
def _is_valid_send_recv(src_stage, dest_stage):
first_stage = 0
last_stage = _grid.pipe_parallel_size - 1
assert abs(src_stage-dest_stage) == 1 or \
(src_stage == first_stage and dest_stage == last_stage) or \
(src_stage == last_stage and dest_stage == first_stage), \
"Functionality currently limited to send and receive between adjacent ranks only"
def send(tensor, dest_stage, async_op=False):
global _groups
async_op = False
src_stage = _grid.get_stage_id()
_is_valid_send_recv(src_stage, dest_stage)
group = _get_send_recv_group(src_stage, dest_stage)
src_rank = _grid.stage_to_global(stage_id=src_stage)
import torch
if tensor.dtype != torch.float32 and DS_PIPE_VERBOSE:
print('warning: p2p send', tensor.dtype, tensor.shape, flush=True)
return _send(tensor, src_rank, group, async_op)
def _bps_get_name(src, dest, name, suffix):
return "_".join([str(src), str(dest), str(name), str(suffix)])
def bps_send(tensor, dest_stage, name, index, async_op=True):
global bps_send_handles
src_stage = _grid.get_stage_id()
_is_valid_send_recv(src_stage, dest_stage)
src_rank = _grid.stage_to_global(stage_id=src_stage)
dest_rank = _grid.stage_to_global(stage_id=dest_stage)
name = _bps_get_name(src_rank, dest_rank, name, index)
if name not in bps_send_handles:
# XXX hard-code max number of tensors for this name
bps_send_handles[name] = [None] * 10
else:
handle = bps_send_handles[name][index]
if handle is not None:
bps.synchronize(handle)
handle = bps.send_async(tensor, dest_rank, name=name)
# XXX
if not async_op:
bps.synchronize(handle)
else:
bps_send_handles[name][index] = handle
return tensor
def bps_sync(src_stage, name, index=0):
dest_stage = _grid.get_stage_id()
_is_valid_send_recv(src_stage, dest_stage)
src_rank = _grid.stage_to_global(stage_id=src_stage)
dest_rank = _grid.stage_to_global(stage_id=dest_stage)
name = _bps_get_name(src_rank, dest_rank, name, index)
if name in bps_recv_handles:
handle = bps_recv_handles[name][index]
if handle is not None:
bps.synchronize(handle)
def bps_sync_all():
for name, handles in bps_send_handles.items():
for handle in handles:
if handle is not None:
bps.synchronize(handle)
for name, handles in bps_recv_handles.items():
for handle in handles:
if handle is not None:
bps.synchronize(handle)
def bps_recv(tensor, src_stage, name, index=0, async_op=True):
global bps_recv_handles
dest_stage = _grid.get_stage_id()
_is_valid_send_recv(src_stage, dest_stage)
src_rank = _grid.stage_to_global(stage_id=src_stage)
dest_rank = _grid.stage_to_global(stage_id=dest_stage)
name = _bps_get_name(src_rank, dest_rank, name, index)
if name not in bps_recv_handles:
# XXX hard-code max number of tensors for this name
bps_recv_handles[name] = [None] * 10
else:
handle = bps_recv_handles[name][index]
if handle is not None:
bps.synchronize(handle)
handle = bps.recv_async(tensor, src_rank, name=name)
if not async_op:
bps.synchronize(handle)
else:
bps_recv_handles[name][index] = handle
return tensor
def _send(tensor, src_rank, group, async_op):
global did_recv
return dist.broadcast(tensor, src_rank, group=group, async_op=async_op)
def send_grads(tensor, grid, async_op=False):
async_op = False
if grid.send_grads_src_rank == grid.global_rank:
# print(f'start rank: {grid.global_rank}, stage_id: {grid.stage_id}, mp_id: {grid.model_parallel_id}, _send_grad_src_rank: {grid.send_grads_src_rank}, send group: {grid.send_grads_group}, send_grad_groups: {grid.send_grads_proc_group}', flush=True)
_send(tensor, grid.send_grads_src_rank, grid.send_grads_proc_group, async_op)
# print(f'finis rank: {grid.global_rank}, stage_id: {grid.stage_id}, mp_id: {grid.model_parallel_id}, _send_grad_src_rank: {grid.send_grads_src_rank}, send group: {grid.send_grads_group}', flush=True)
else:
# print(f'finish fast rank: {grid.global_rank}, stage_id: {grid.stage_id}, mp_id: {grid.model_parallel_id}, _send_grad_src_rank: {grid.send_grads_src_rank}, send group: {grid.send_grads_group}', flush=True)
pass
def _recv(tensor, src_rank, group, async_op):
global did_recv
tensor = dist.broadcast(tensor, src_rank, group=group, async_op=async_op)
did_recv = True
return tensor
def recv_grads(tensor, grid, async_op=False):
async_op = False
# print(f'start rank: {grid.global_rank}, stage_id: {grid.stage_id}, mp_id: {grid.model_parallel_id}, _recv_grad_src_rank: {grid.recv_grads_src_rank}, recv group: {grid.recv_grads_group}, recv_grad_groups: {grid.recv_grads_proc_group}', flush=True)
_recv(tensor, grid.recv_grads_src_rank, grid.recv_grads_proc_group, async_op)
# print(f'finish rank: {grid.global_rank}, stage_id: {grid.stage_id}, mp_id: {grid.model_parallel_id}, _recv_grad_src_rank: {grid.recv_grads_src_rank}, recv group: {grid.recv_grads_group}', flush=True)
def send_activations(tensor, grid, async_op=False):
async_op = False
if grid.send_activation_src_rank == grid.global_rank:
# print(f'start rank: {grid.global_rank}, stage_id: {grid.stage_id}, mp_id: {grid.model_parallel_id}, _send_grad_src_rank: {grid.send_grads_src_rank}, send group: {grid.send_grads_group}, send_grad_groups: {grid.send_grads_proc_group}', flush=True)
_send(tensor, grid.send_activation_src_rank, grid.send_activation_proc_group, async_op)
# print(f'finis rank: {grid.global_rank}, stage_id: {grid.stage_id}, mp_id: {grid.model_parallel_id}, _send_grad_src_rank: {grid.send_grads_src_rank}, send group: {grid.send_grads_group}', flush=True)
else:
# print(f'finish fast rank: {grid.global_rank}, stage_id: {grid.stage_id}, mp_id: {grid.model_parallel_id}, _send_grad_src_rank: {grid.send_grads_src_rank}, send group: {grid.send_grads_group}', flush=True)
pass
def recv_activations(tensor, grid, async_op=False):
async_op = False
# print(f'start rank: {grid.global_rank}, stage_id: {grid.stage_id}, mp_id: {grid.model_parallel_id}, _recv_grad_src_rank: {grid.recv_grads_src_rank}, recv group: {grid.recv_grads_group}, recv_grad_groups: {grid.recv_grads_proc_group}', flush=True)
_recv(tensor, grid.recv_activation_src_rank, grid.recv_activation_proc_group, async_op)
# print(f'finish rank: {grid.global_rank}, stage_id: {grid.stage_id}, mp_id: {grid.model_parallel_id}, _recv_grad_src_rank: {grid.recv_grads_src_rank}, recv group: {grid.recv_grads_group}', flush=True)
def recv(tensor, src_stage, async_op=False):
global _groups
global did_recv
async_op = False
dest_stage = _grid.get_stage_id()
_is_valid_send_recv(src_stage, dest_stage)
group = _get_send_recv_group(src_stage, dest_stage)
src_rank = _grid.stage_to_global(stage_id=src_stage)
return _recv(tensor, src_rank, group, async_op)
def barrier(stage_id):
global _groups, _grid
group_id = _grid.stage_to_global(stage_id=stage_id)
if (dist.get_rank() >= 0):
print("Barrier Group ID", group_id)
print("Barrier Group", _grid.p2p_groups[group_id])
dist.barrier(group=_groups[group_id])
if (dist.get_rank() >= 0):
print("Exiting Barrier ", group_id)
def _get_send_recv_group(src_stage, dest_stage):
'''the group id is always the smaller rank unless its a wrap around'''
stage_id = None
first_stage = 0
last_stage = _grid.pipe_parallel_size - 1
if (src_stage == first_stage and dest_stage == last_stage
or dest_stage == first_stage and src_stage == last_stage):
stage_id = last_stage
elif src_stage > dest_stage:
stage_id = dest_stage
else:
stage_id = src_stage
'''group_id corresponds to group of [group_id, group_id+1]
unless group_id is the rank of the last stage
in which case group_id correspods to group[group_id-num_stages+1, group_id]
'''
group_id = _grid.stage_to_global(stage_id=stage_id)
return _groups[group_id]
================================================
FILE: src/veGiantModel/engine/schedule.py
================================================
# Copyright (c) 2021, ByteDance Inc. All rights reserved.
from deepspeed.runtime.pipe.schedule import (
BufferOpInstruction,PipeInstruction,
ReduceTiedGrads,ReduceGrads,OptimizerStep,
LoadMicroBatch,PipeSchedule,TrainSchedule,
)
import os
BYTEPS_REDUCED_MEM = os.environ.get('BYTEPS_REDUCED_MEM', '1') != '0'
class BytePSInferenceSchedule(PipeSchedule):
"""A schedule for inferencing batches using pipeline parallelism.
"""
def __init__(self, micro_batches, stages, stage_id, prefetch=True):
super().__init__(micro_batches, stages, stage_id)
self.prefetch = prefetch
def steps(self):
""""""
total_steps = self.micro_batches + self.stages - 1
for step_id in range(total_steps):
cmds = []
micro_batch_id = step_id - self.stage_id
buffer_id = micro_batch_id % self.num_pipe_buffers()
batch_is_valid = self._valid_micro_batch(micro_batch_id)
if not self.prefetch:
if batch_is_valid:
if self.is_first_stage or self.is_last_stage:
cmds.append(LoadMicroBatch(buffer_id))
if self._valid_stage(self.prev_stage):
cmds.append(BytePSRecvActivation(buffer_id))
cmds.append(BytePSSyncActivation(buffer_id))
cmds.append(BytePSForwardPass(buffer_id))
if self._valid_stage(self.next_stage):
cmds.append(BytePSSendActivation(buffer_id))
else:
next_buffer_id = (micro_batch_id + 1) % self.num_pipe_buffers()
next_batch_is_valid = self._valid_micro_batch(micro_batch_id + 1)
# micro_batch starts at 0. Get the current batch, and start prefetching
if micro_batch_id == 0:
if self.is_first_stage or self.is_last_stage:
cmds.append(LoadMicroBatch(buffer_id))
if self._valid_stage(self.prev_stage):
cmds.append(BytePSRecvActivation(buffer_id))
if next_batch_is_valid:
cmds.append(BytePSRecvActivation(next_buffer_id))
cmds.append(BytePSSyncActivation(buffer_id))
cmds.append(BytePSForwardPass(buffer_id))
if self._valid_stage(self.next_stage):
cmds.append(BytePSSendActivation(buffer_id))
elif batch_is_valid:
# After micro_batch 0, we prefetch the next one,
# and wait for the current one
if self._valid_stage(self.prev_stage) and next_batch_is_valid:
cmds.append(BytePSRecvActivation(next_buffer_id))
if self.is_first_stage or self.is_last_stage:
cmds.append(LoadMicroBatch(buffer_id))
if self._valid_stage(self.prev_stage):
cmds.append(BytePSSyncActivation(buffer_id))
cmds.append(BytePSForwardPass(buffer_id))
if self._valid_stage(self.next_stage):
cmds.append(BytePSSendActivation(buffer_id))
yield cmds
def num_pipe_buffers(self):
"""Only `self.micro_batches` pipeline buffers are required for inferencing.
Returns:
``self.micro_batches``
"""
buffers = min(self.micro_batches, self.stages * 2)
if BYTEPS_REDUCED_MEM:
buffers = min(self.stages + 1, self.micro_batches)
return max(2, buffers)
class BytePSTrainSchedule(TrainSchedule):
"""A schedule for training a batch using hybrid parallelism.
Pipeline parallelism is extracted through gradient accumulation and thus
convergence follows that of a data parallel approach with the same batch
size.
"""
def __init__(self, micro_batches, stages, stage_id, prefetch=True):
super().__init__(micro_batches, stages, stage_id)
self.prefetch = prefetch and micro_batches > 1
if not self.prefetch:
print('BYTEPS NO PREFETCH STEPS', flush=True)
def steps(self):
if self.prefetch:
return self._steps()
else:
return self._steps_no_prefetch()
def _steps(self):
""""""
total_steps = 2 * (self.micro_batches + self.stages - 1)
for step_id in range(total_steps):
# Map the step of the pipeline to the micro-batch id and also whether it is a
# forward or backward pass step.
cmds = []
micro_batch_id, is_forward = self._step_to_micro_batch(step_id)
batch_is_valid = self._valid_micro_batch(micro_batch_id)
if not batch_is_valid:
if step_id == total_steps - 1:
cmds.append(BytePSSyncAll())
cmds.append(ReduceTiedGrads())
cmds.append(ReduceGrads())
cmds.append(OptimizerStep())
yield cmds
continue
curr_buffer = self._buffer_idx(micro_batch_id)
# try to find the next valid batch
next_step_id = step_id + 1
next_micro_batch_id, next_is_forward, next_batch_is_valid = None, None, None
while next_step_id < total_steps:
next_micro_batch_id, next_is_forward = self._step_to_micro_batch(next_step_id)
next_batch_is_valid = self._valid_micro_batch(next_micro_batch_id)
if next_batch_is_valid:
break
next_step_id += 1
next_buffer = None
if next_batch_is_valid:
next_buffer = self._buffer_idx(next_micro_batch_id)
if micro_batch_id == 0 and is_forward:
# first/last stage loads
if self.stage_id == 0 or self.stage_id == self.stages - 1:
cmds.append(LoadMicroBatch(curr_buffer))
# fetch
if self._valid_stage(self.prev_stage):
cmds.append(BytePSRecvActivation(curr_buffer))
# pre-fetch
if next_batch_is_valid:
if self._valid_stage(self.prev_stage) and next_is_forward:
cmds.append(BytePSRecvActivation(next_buffer))
if self._valid_stage(self.next_stage) and not next_is_forward:
cmds.append(BytePSRecvGrad(next_buffer))
# sync and compute
if self._valid_stage(self.prev_stage):
cmds.append(BytePSSyncActivation(curr_buffer))
cmds.append(BytePSForwardPass(curr_buffer))
if self._valid_stage(self.next_stage):
cmds.append(BytePSSendActivation(curr_buffer))
else:
# prefetch
if next_batch_is_valid:
if self._valid_stage(self.prev_stage) and next_is_forward:
cmds.append(BytePSRecvActivation(next_buffer))
if self._valid_stage(self.next_stage) and not next_is_forward:
cmds.append(BytePSRecvGrad(next_buffer))
if is_forward:
if self.stage_id == 0 or self.stage_id == self.stages - 1:
# First/last stage loads
cmds.append(LoadMicroBatch(curr_buffer))
if self._valid_stage(self.prev_stage):
cmds.append(BytePSSyncActivation(curr_buffer))
cmds.append(BytePSForwardPass(curr_buffer))
if self._valid_stage(self.next_stage):
cmds.append(BytePSSendActivation(curr_buffer))
else:
if self._valid_stage(self.next_stage):
cmds.append(BytePSSyncGrad(curr_buffer))
cmds.append(BytePSBackwardPass(curr_buffer))
if self._valid_stage(self.prev_stage):
cmds.append(BytePSSendGrad(curr_buffer))
# Model step at the end of the batch
if step_id == total_steps - 1:
cmds.append(BytePSSyncAll())
cmds.append(ReduceTiedGrads())
cmds.append(ReduceGrads())
cmds.append(OptimizerStep())
yield cmds
def _steps_no_prefetch(self):
""""""
total_steps = 2 * (self.micro_batches + self.stages - 1)
for step_id in range(total_steps):
# Map the step of the pipeline to the micro-batch id and also whether it is a
# forward or backward pass step.
cmds = []
micro_batch_id, is_forward = self._step_to_micro_batch(step_id)
batch_is_valid = self._valid_micro_batch(micro_batch_id)
if not batch_is_valid:
if step_id == total_steps - 1:
cmds.append(BytePSSyncAll())
cmds.append(ReduceTiedGrads())
cmds.append(ReduceGrads())
cmds.append(OptimizerStep())
yield cmds
continue
curr_buffer = self._buffer_idx(micro_batch_id)
if is_forward:
if self._valid_stage(self.prev_stage):
cmds.append(BytePSRecvActivation(curr_buffer))
cmds.append(BytePSSyncActivation(curr_buffer))
if self.stage_id == 0 or self.stage_id == self.stages - 1:
# First/last stage loads
cmds.append(LoadMicroBatch(curr_buffer))
cmds.append(BytePSForwardPass(curr_buffer))
if self._valid_stage(self.next_stage):
cmds.append(BytePSSendActivation(curr_buffer))
else:
if self._valid_stage(self.next_stage):
cmds.append(BytePSRecvGrad(curr_buffer))
cmds.append(BytePSSyncGrad(curr_buffer))
cmds.append(BytePSBackwardPass(curr_buffer))
if self._valid_stage(self.prev_stage):
cmds.append(BytePSSendGrad(curr_buffer))
# Model step at the end of the batch
if step_id == total_steps - 1:
cmds.append(BytePSSyncAll())
cmds.append(ReduceTiedGrads())
cmds.append(ReduceGrads())
cmds.append(OptimizerStep())
yield cmds
def num_pipe_buffers(self):
"""As many buffers as the distance from this stage to the last stage.
"""
buffers = min(self.micro_batches, self.stages * 2)
if BYTEPS_REDUCED_MEM:
buffers = min(self.stages + 1, self.micro_batches)
return max(2, buffers)
class BytePSSendActivation(BufferOpInstruction):
pass
class BytePSRecvActivation(BufferOpInstruction):
pass
class BytePSSyncActivation(BufferOpInstruction):
pass
class BytePSSyncGrad(BufferOpInstruction):
pass
class BytePSSendGrad(BufferOpInstruction):
pass
class BytePSRecvGrad(BufferOpInstruction):
pass
class BytePSForwardPass(BufferOpInstruction):
pass
class BytePSBackwardPass(BufferOpInstruction):
pass
class BytePSSyncAll(PipeInstruction):
pass
================================================
FILE: src/veGiantModel/engine/topology.py
================================================
# Copyright (c) 2021, ByteDance Inc. All rights reserved.
# Copyright 2019 The Microsoft DeepSpeed Team
from deepspeed.utils import log_dist
import torch.distributed as dist
from collections import namedtuple
from itertools import product as cartesian_product
import logging, os
import torch
class ProcessTopology:
""" Manages the mapping of n-dimensional Cartesian coordinates to linear
indices. This mapping is used to map the rank of processes to the grid
for various forms of parallelism.
Each axis of the tensor is accessed by its name. The provided ordering
of the axes defines the layout of the topology. ProcessTopology uses a "row-major"
layout of the tensor axes, and so axes=['x', 'y'] would map coordinates (x,y) and
(x,y+1) to adjacent linear indices. If instead axes=['y', 'x'] was used, coordinates
(x,y) and (x+1,y) would be adjacent.
Some methods return ProcessCoord namedtuples.
"""
def __init__(self, axes, dims):
"""Create a mapping of n-dimensional tensor coordinates to linear indices.
Arguments:
axes (list): the names of the tensor axes
dims (list): the dimension (length) of each axis of the topology tensor
"""
self.axes = axes # names of each topology axis
self.dims = dims # length of each topology axis
# This is actually a class that lets us hash {'row':3, 'col':2} mappings
self.ProcessCoord = namedtuple('ProcessCoord', axes)
self.mapping = {}
ranges = [range(d) for d in dims]
# example: 1, (0,0,1)
for global_rank, coord in enumerate(cartesian_product(*ranges)):
key = {axis: coord[self.axes.index(axis)] for axis in self.axes}
key = self.ProcessCoord(**key)
# for example, {ProcessCoord(row=0, col=1) : 1}
self.mapping[key] = global_rank
def get_rank(self, **coord_kwargs):
"""Return the global rank of a process via its coordinates.
Coordinates are specified as kwargs. For example:
>>> X = ProcessTopology(axes=['x', 'y'], dims=[2,3])
>>> X.get_rank(x=0, y=1)
1
"""
if len(coord_kwargs) != len(self.axes):
raise ValueError('get_rank() does not support slices. Use filter_match())')
key = self.ProcessCoord(**coord_kwargs)
assert key in self.mapping, f'key {kwargs} invalid'
return self.mapping[key]
def get_axis_names(self):
"""Return a list of the axis names in the ordering of the topology. """
return self.axes
def get_rank_repr(self,
rank,
omit_axes=['data',
'pipe'],
inner_sep='_',
outer_sep='-'):
"""Return a string representation of a rank.
This method is primarily used for checkpointing model data.
For example:
>>> topo = Topo(axes=['a', 'b'], dims=[2, 2])
>>> topo.get_rank_repr(rank=3)
'a_01-b_01'
>>> topo.get_rank_repr(rank=3, omit_axes=['a'])
'b_01'
Args:
rank (int): A rank in the topology.
omit_axes (list, optional): Axes that should not be in the representation. Defaults to ['data', 'pipe'].
inner_sep (str, optional): [description]. Defaults to '_'.
outer_sep (str, optional): [description]. Defaults to '-'.
Returns:
str: A string representation of the coordinate owned by ``rank``.
"""
omit_axes = frozenset(omit_axes)
axes = [a for a in self.get_axis_names() if a not in omit_axes]
names = []
for ax in axes:
ax_rank = getattr(self.get_coord(rank=rank), ax)
names.append(f'{ax}{inner_sep}{ax_rank:02d}')
return outer_sep.join(names)
def get_dim(self, axis):
"""Return the number of processes along the given axis.
For example:
>>> X = ProcessTopology(axes=['x', 'y'], dims=[2,3])
>>> X.get_dim('y')
3
"""
if axis not in self.axes:
return 0
return self.dims[self.axes.index(axis)]
def get_coord(self, rank):
"""Return the coordinate owned by a process rank.
The axes of the returned namedtuple can be directly accessed as members. For
example:
>>> X = ProcessTopology(axes=['x', 'y'], dims=[2,3])
>>> coord = X.get_coord(rank=1)
>>> coord.x
0
>>> coord.y
1
"""
for coord, idx in self.mapping.items():
if idx == rank:
return coord
raise ValueError(f'rank {rank} not found in topology.')
def get_axis_comm_lists(self, axis):
""" Construct lists suitable for a communicator group along axis ``axis``.
Example:
>>> topo = Topo(axes=['pipe', 'data', 'model'], dims=[2, 2, 2])
>>> topo.get_axis_comm_lists('pipe')
[
[0, 4], # data=0, model=0
[1, 5], # data=0, model=1
[2, 6], # data=1, model=0
[3, 7], # data=1, model=1
]
Returns:
A list of lists whose coordinates match in all axes *except* ``axis``.
"""
# We don't want to RuntimeError because it allows us to write more generalized
# code for hybrid parallelisms.
if axis not in self.axes:
return []
# Grab all axes but `axis`
other_axes = [a for a in self.axes if a != axis]
lists = []
# Construct all combinations of coords with other_axes
ranges = [range(self.get_dim(a)) for a in other_axes]
for coord in cartesian_product(*ranges):
other_keys = {a: coord[other_axes.index(a)] for a in other_axes}
# now go over all ranks in `axis`.
sub_list = []
for axis_key in range(self.get_dim(axis)):
key = self.ProcessCoord(**other_keys, **{axis: axis_key})
sub_list.append(self.mapping[key])
lists.append(sub_list)
return lists
def filter_match(self, **filter_kwargs):
"""Return the list of ranks whose coordinates match the provided criteria.
Example:
>>> X = ProcessTopology(axes=['pipe', 'data', 'model'], dims=[2, 2, 2])
>>> X.filter_match(pipe=0, data=1)
[2, 3]
>>> [X.get_coord(rank) for rank in X.filter_match(pipe=0, data=1)]
[ProcessCoord(pipe=0, data=1, model=0), ProcessCoord(pipe=0, data=1, model=1)]
Arguments:
**filter_kwargs (dict): criteria used to select coordinates.
Returns:
The list of ranks whose coordinates match filter_kwargs.
"""
def _filter_helper(x):
for key, val in filter_kwargs.items():
if getattr(x, key) != val:
return False
return True
coords = filter(_filter_helper, self.mapping.keys())
return [self.mapping[coo] for coo in coords]
def get_axis_list(self, axis, idx):
"""Returns the list of global ranks whose coordinate in an axis is idx.
For example:
>>> X = ProcessTopology(axes=['x', 'y'], dims=[2,3])
>>> X.get_axis_list(axis='x', idx=0)
[0, 1, 2]
>>> X.get_axis_list(axis='y', idx=0)
[0, 3]
"""
# This could be faster by generating the desired keys directly instead of
# filtering.
axis_num = self.axes.index(axis)
ranks = [self.mapping[k] for k in self.mapping.keys() if k[axis_num] == idx]
return ranks
def world_size(self):
return len(self.mapping)
def __str__(self):
return str(self.mapping)
def _prime_factors(N):
""" Returns the prime factorization of positive integer N. """
if N <= 0:
raise ValueError("Values must be strictly positive.")
primes = []
while N != 1:
for candidate in range(2, N + 1):
if N % candidate == 0:
primes.append(candidate)
N //= candidate
break
return primes
class PipeDataParallelTopology(ProcessTopology):
""" A topology specialiation for hybrid data and pipeline parallelism.
Uses data parallelism on the last dimension to encourage gradient
reductions to use high-bandwidth intra-node links and lower-volume
pipeline communications to use low-bandwidth inter-node links.
"""
def __init__(self, num_pp, num_dp):
super().__init__(axes=['pipe', 'data'], dims=[num_pp, num_dp])
class PipeModelDataParallelTopology(ProcessTopology):
""" A topology for hybrid pipeline, model, and data parallelism. """
def __init__(self, num_dp, num_pp, num_mp):
# super().__init__(axes=['model', 'data', 'pipe'], dims=[num_mp, num_dp, num_pp])
super().__init__(axes=['pipe', 'data', 'model'], dims=[num_pp, num_dp, num_mp])
class PipelineParallelGrid:
"""Implements a grid object that stores the data parallel ranks
corresponding to each o the model parallel stages
The grid object organizes the processes in a distributed pytorch job
into a 2D grid, of stage_id and data_parallel_id.
self.stage_id and self.data_parallel_id stores the stage id
and the data parallel id of current process.
self.dp_group groups the processes by stage_id.
self.dp_group[i], is a list containing all process ranks whose
stage_id is i.
self.p2p_groups stores a list of tuple, where each tuple
stores process ranks of adjacent stages for a given data_parallel_id.
For example if num_stage is 5 then a tuple [7,8] represents stages [3, 4],
with data_parallel id = 1. A stage wrap around will appear as non-adjacent ranks,
for example tuple [4,0] with representing wrap-around stage 4 and 0, for
data_parallel_id = 0, or similarly [9,5] represents wrapped around stages [4,0]
for data_parallel_id = 1.
"""
def __init__(self, topology=None, process_group=None):
# TODO use process_group if provided
self.global_rank = dist.get_rank()
self.world_size = dist.get_world_size()
if topology is not None:
log_dist(f'building PipelineParallelGrid with topology: {topology}', ranks=[-1], level=logging.DEBUG)
self._topo = topology
else:
num_pp = 1
num_dp = 1
for idx, prime in enumerate(_prime_factors(self.world_size)):
if idx % 2 == 0:
num_pp *= prime
else:
num_dp *= prime
self._topo = PipeDataParallelTopology(num_dp=num_dp, num_pp=num_pp)
self.data_parallel_size = max(self._topo.get_dim('data'), 1)
self.pipe_parallel_size = max(self._topo.get_dim('pipe'), 1)
self.model_parallel_size = max(self._topo.get_dim('model'), 1)
assert self._is_grid_valid(), "Invalid Grid"
self.stage_id = self.get_stage_id()
self.data_parallel_id = self.get_data_parallel_id()
self.model_parallel_id = self.get_model_parallel_id()
self.slice_parallel_src_id = self.get_src_parallel_src_id()
log_dist(f'stage_id: {self.stage_id}, slice_parallel_src_id: {self.slice_parallel_src_id}', ranks=[-1], level=logging.DEBUG)
# Create new ProcessGroups for all model parallelism. DeepSpeedLight uses these
# to detect overflow, etc.
self.ds_model_proc_group = None
self.ds_model_rank = -1
for dp in range(self.data_parallel_size):
ranks = sorted(self._topo.get_axis_list(axis='data', idx=dp))
if self.global_rank == 0:
#print(f'RANK={self.global_rank} building DeepSpeed model group: {ranks}')
pass
proc_group = dist.new_group(ranks=ranks)
if self.global_rank in ranks:
log_dist(f'data_parallel_id: {self.data_parallel_id}, model_parallel_id: {self.model_parallel_id}, \
stage_id: {self.stage_id}, building ds model group: {ranks}', ranks=[-1], level=logging.DEBUG)
self.ds_model_proc_group = proc_group
self.ds_model_world_size = len(ranks)
self.ds_model_rank = ranks.index(self.global_rank)
assert self.ds_model_rank > -1
assert self.ds_model_proc_group is not None
# Create new ProcessGroup for gradient all-reduces - these are the data parallel groups
self.dp_group = []
self.dp_groups = self._topo.get_axis_comm_lists('data')
for g in self.dp_groups:
proc_group = dist.new_group(ranks=g)
if self.global_rank in g:
log_dist(f'data_parallel_id: {self.data_parallel_id}, model_parallel_id: {self.model_parallel_id}, \
stage_id: {self.stage_id}, building dp group: {g}', ranks=[-1], level=logging.DEBUG)
self.dp_group = g
self.dp_proc_group = proc_group
self.is_first_stage = (self.stage_id == 0)
self.is_last_stage = (self.stage_id == (self.pipe_parallel_size - 1))
self.p2p_groups = self._build_p2p_groups()
self._build_grads_groups()
self._build_activation_groups()
self._build_grads_groups()
self._build_activation_groups()
# Create new ProcessGroup for pipeline collectives - these are pipe parallel groups
self.pp_group = []
self.pp_proc_group = None
self.pipe_groups = self._topo.get_axis_comm_lists('pipe')
for ranks in self.pipe_groups:
# if self.global_rank == 0:
# #print(f'RANK={self.global_rank} building pipeline group: {ranks}')
# pass
proc_group = dist.new_group(ranks=ranks)
if self.global_rank in ranks:
log_dist(f'data_parallel_id: {self.data_parallel_id}, model_parallel_id: {self.model_parallel_id},\
stage_id: {self.stage_id}, building pipeline group: {ranks}', \
ranks=[-1], level=logging.DEBUG)
self.pp_group = ranks
self.pp_proc_group = proc_group
assert self.pp_proc_group is not None
# Create new ProcessGroup for model (tensor-slicing) collectives
# Short circuit case without model parallelism.
# TODO: it would be nice if topology had bcast semantics to avoid this branching
# case?
if self.model_parallel_size == 1:
for group_rank in range(self.world_size):
group_rank = [group_rank]
group = dist.new_group(ranks=group_rank)
if group_rank[0] == self.global_rank:
self.slice_group = group_rank
self.slice_proc_group = group
return
else:
self.mp_group = []
self.model_groups = self._topo.get_axis_comm_lists('model')
for g in self.model_groups:
proc_group = dist.new_group(ranks=g)
if self.global_rank in g:
log_dist(f'data_parallel_id: {self.data_parallel_id}, model_parallel_id: {self.model_parallel_id}, \
stage_id: {self.stage_id}, building slice group: {g}', ranks=[-1], level=logging.DEBUG)
self.slice_group = g
self.slice_proc_group = proc_group
def get_stage_id(self):
return self._topo.get_coord(rank=self.global_rank).pipe
def get_data_parallel_id(self):
return self._topo.get_coord(rank=self.global_rank).data
def get_model_parallel_id(self):
if 'model' in self._topo.get_axis_names():
return self._topo.get_coord(rank=self.global_rank).model
return 0
def get_src_parallel_src_id(self):
if 'model' not in self._topo.get_axis_names():
return 0
return self.stage_to_global(stage_id=self.stage_id,
data=self.data_parallel_id,
model=0)
def _build_p2p_groups(self):
"""Groups for sending and receiving activations and gradients across model
parallel stages.
"""
comm_lists = self._topo.get_axis_comm_lists('pipe')
log_dist(f'_build_p2p_groups data_parallel_id: {self.data_parallel_id}, \
model_parallel_id: {self.model_parallel_id}, stage_id: {self.stage_id}, \
comm_lists: {comm_lists}', ranks=[-1], level=logging.DEBUG)
p2p_lists = []
for rank in range(self.world_size):
for l in comm_lists:
assert len(l) == self.pipe_parallel_size
if rank in l:
idx = l.index(rank)
buddy_rank = l[(idx + 1) % self.pipe_parallel_size]
p2p_lists.append([rank, buddy_rank])
break # next global rank
assert len(p2p_lists) == self.world_size
log_dist(f'data_parallel_id: {self.data_parallel_id}, model_parallel_id: \
{self.model_parallel_id}, stage_id: {self.stage_id}, \
p2p_lists: {p2p_lists}', ranks=[-1], level=logging.DEBUG)
return p2p_lists
def _build_grads_groups(self):
self.send_grads_src_rank = -1
self.recv_grads_src_rank = -1
self.send_grads_group = []
self.recv_grads_group = []
self.send_grads_proc_group = None
self.recv_grads_proc_group = None
self.grads_proc_groups = []
for dp_id in range(self.data_parallel_size):
for stage in range(self.pipe_parallel_size):
next_stage = stage + 1
prev_stage = stage - 1
grads_group = []
grads_proc_group = None
if prev_stage > -1:
grads_src_rank = self._topo.filter_match(data=dp_id, pipe=stage, model=0)[0]
prev_mp_group = self._topo.filter_match(data=dp_id, pipe=prev_stage)
grads_group.append(grads_src_rank)
grads_group.extend(prev_mp_group)
grads_group.sort()
# log_dist(f'_build_grads_groups stage: {stage}, grads_group: {grads_group}', ranks=[-1])
grads_proc_group = dist.new_group(ranks=grads_group)
self.grads_proc_groups.append(grads_proc_group)
if stage == self.stage_id and self.data_parallel_id == dp_id:
self.send_grads_src_rank = grads_src_rank
self.send_grads_group = grads_group
self.send_grads_proc_group = grads_proc_group
elif stage == self.stage_id + 1 and self.data_parallel_id == dp_id:
self.recv_grads_src_rank = grads_src_rank
self.recv_grads_group = grads_group
self.recv_grads_proc_group = grads_proc_group
log_dist(f'_build_grads_groups stage: {self.stage_id}, send_grads_src_rank : {self.send_grads_src_rank}, '
f'send_grads_group: {self.send_grads_group}, recv_grads_group: {self.recv_grads_group}', \
ranks=[-1], level=logging.DEBUG)
def _build_activation_groups(self):
self.send_activation_src_rank = -1
self.recv_activation_src_rank = -1
self.send_activation_group = []
self.recv_activation_group = []
self.send_activation_proc_group = None
self.recv_activation_proc_group = None
self.activation_proc_groups = []
for dp_id in range(self.data_parallel_size):
for stage in range(self.pipe_parallel_size):
next_stage = stage + 1
prev_stage = stage - 1
activation_group = []
activation_proc_group = None
if next_stage < self.pipe_parallel_size:
activation_src_rank = self._topo.filter_match(data=dp_id, pipe=stage, model=0)[0]
next_mp_group = self._topo.filter_match(data=dp_id, pipe=next_stage)
activation_group.append(activation_src_rank)
activation_group.extend(next_mp_group)
activation_group.sort()
activation_proc_group = dist.new_group(ranks=activation_group)
self.activation_proc_groups.append(activation_proc_group)
if stage == self.stage_id and self.data_parallel_id == dp_id:
self.send_activation_src_rank = activation_src_rank
self.send_activation_group = activation_group
self.send_activation_proc_group = activation_proc_group
elif stage == self.stage_id - 1 and self.data_parallel_id == dp_id:
self.recv_activation_src_rank = activation_src_rank
self.recv_activation_group = activation_group
self.recv_activation_proc_group = activation_proc_group
log_dist(f'_build_activation_groups stage: {self.stage_id}, send_activation_src_rank : '\
f'{self.send_activation_src_rank}, send_activation_group: {self.send_activation_group}, '\
f'recv_grads_group: {self.recv_grads_group}', ranks=[-1], level=logging.DEBUG)
def _is_grid_valid(self):
ranks = 1
for ax in self._topo.get_axis_names():
ranks *= self._topo.get_dim(ax)
return ranks == dist.get_world_size()
#returns the global rank of the process with the provided stage id
#which has the same data_parallel_id as caller process
def stage_to_global(self, stage_id, **kwargs):
me = self._topo.get_coord(self.global_rank)
transform = me._replace(pipe=stage_id, **kwargs)._asdict()
return self._topo.get_rank(**transform)
#returns the byteps rank of the process with the provided stage id
def stage_to_byteps(self, stage_id):
return self.pipe_parallel_size * self.data_parallel_id + stage_id
def topology(self):
return self._topo
# MPU functions for DeepSpeed integration
def get_global_rank(self):
return self.global_rank
def get_pipe_parallel_rank(self):
""" The stage of the pipeline this rank resides in. """
return self.stage_id
def get_pipe_parallel_world_size(self):
""" The number of stages in the pipeline. """
return self.pipe_parallel_size
def get_pipe_parallel_group(self):
""" The group of ranks within the same pipeline. """
return self.pp_proc_group
def get_data_parallel_rank(self):
""" Which pipeline this rank resides in. """
return self.data_parallel_id
def get_data_parallel_world_size(self):
""" The number of pipelines. """
return self.data_parallel_size
def get_data_parallel_group(self):
""" The group of ranks within the same stage of all pipelines. """
return self.dp_proc_group
# These are model parallel groups across all types of model parallelism.
# Deepspeed uses them to detect overflow, etc.
def get_model_parallel_rank(self):
return self.model_parallel_id
def get_model_parallel_world_size(self):
return self.model_parallel_size
def get_model_parallel_group(self):
return self.slice_proc_group
# For Megatron-style tensor slicing
def get_slice_parallel_rank(self):
return self.model_parallel_id
def get_slice_parallel_world_size(self):
return self.model_parallel_size
def get_slice_parallel_group(self):
return self.slice_proc_group
def get_slice_parallel_src_rank(self):
return self.slice_parallel_src_id
================================================
FILE: src/veGiantModel/initialize.py
================================================
# Copyright (c) 2021, ByteDance Inc. All rights reserved.
# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
import torch
import os
import random
import numpy as np
from megatron.global_vars import set_global_variables
from megatron import get_args, mpu, print_rank_0
from .engine.topology import PipeModelDataParallelTopology, PipelineParallelGrid
from .launcher.launch import launch_bps
from deepspeed.utils import log_dist
import logging
def add_byte_giant_model_customize_args(parser):
import deepspeed
parser = deepspeed.add_config_arguments(parser)
group = parser.add_argument_group(title='bytedance')
group.add_argument('--cpu-optimizer', action='store_true',
help='Run optimizer on CPU')
group.add_argument('--cpu_torch_adam', action='store_true',
help='Use Torch Adam as optimizer on CPU.')
group.add_argument('--vocab-size', type=int, default=1000,
help='vocab size.')
group.add_argument('--train-batch-size', type=int, default=0,
help='global batch size')
group.add_argument('--train_micro_batch_size_per_gpu', type=int, default=0,
help='Batch size per model instance (for deepspeed). '
'Global batch size is local batch size times data '
'parallel size.')
group.add_argument('--deepspeed-activation-checkpointing', action='store_true',
help='deepspeed_activation_checkpointing.')
group.add_argument('--deepspeed-pipeline', action='store_true',
help='enable pipeline parallelism via deepspeed.')
group.add_argument('--ci', action='store_true', help="run in CI environment")
group.add_argument('--gradient_accumulation_steps', type=int, default=1,
help="set gradient_accumulation_steps for deepspeed config")
group.add_argument('--train_batch_size', type=int, default=0,
help="train_batch_size")
group.add_argument('--broadcast_activation', action='store_true', help="use broadcast to send/recv activation")
group.add_argument('--broadcast_grads', action='store_true', help="use broadcast to send/recv grads")
group.add_argument('--partition_method', type=str, default='uniform',
help='the method to partition layers in pipeline parallelism.')
group.add_argument('--config_param', type=str, default='',
help='json dict for deepspeed config')
group.add_argument('--num-stages', type=int, default=1,
help='number of stages')
return parser
def initialize_megatron(extra_args_provider=None, args_defaults={}):
set_global_variables(extra_args_provider=add_byte_giant_model_customize_args, args_defaults=args_defaults)
args = get_args()
init_distribute(args.num_stages, args.model_parallel_size)
_set_random_seed(args.seed)
def _init_topology(num_stages, mp_size):
num_pp = num_stages
num_mp = mp_size
num_dp = (torch.distributed.get_world_size() // num_pp) // num_mp
log_dist('rank: {args.rank}, init topology with num_pp:{num_pp}, num_mp:{num_mp}, \
num_dp: {num_dp}', ranks=[-1], level=logging.DEBUG)
topology = PipeModelDataParallelTopology(num_pp=num_pp, num_mp=num_mp, num_dp=num_dp)
log_dist(f'finish building topology, topology.mapping: {topology.mapping}', \
ranks=[-1], level=logging.DEBUG)
return PipelineParallelGrid(topology)
def _set_random_seed(seed):
"""Set random seed for reproducability."""
if seed is not None and seed > 0:
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
if torch.cuda.device_count() > 0:
mpu.model_parallel_cuda_manual_seed(seed)
else:
raise ValueError('Seed ({}) should be a positive integer.'.format(seed))
def init_distribute(num_stages, mp_size,
distributed_backend='nccl', init_method='tcp://'):
rank = int(os.getenv('RANK', '0'))
world_size = int(os.getenv("WORLD_SIZE", '1'))
device_count = torch.cuda.device_count()
local_rank = rank % device_count
if torch.distributed.is_initialized():
print_rank_0('torch distributed is already initialized, '
'skipping initialization ...')
else:
print_rank_0('> initializing torch distributed ...')
torch.cuda.set_device(local_rank)
# Call the init process
master_ip = os.getenv('MASTER_ADDR', 'localhost')
master_port = os.getenv('MASTER_PORT', '6000')
init_method += master_ip + ':' + master_port
torch.distributed.init_process_group(
backend=distributed_backend,
world_size=world_size, rank=rank,
init_method=init_method)
# Set the model-parallel / data-parallel communicators.
grid = _init_topology(num_stages, mp_size)
mpu.initialize_model_parallel(grid)
if num_stages > 1:
import byteps.torch as bps
assert bps is not None
launch_bps(local_rank)
================================================
FILE: src/veGiantModel/launcher/launch.py
================================================
# Copyright (c) 2021, ByteDance Inc. All rights reserved.
#!/usr/bin/python
from __future__ import print_function
import os
import subprocess
import threading
import sys
from megatron import mpu
from deepspeed.utils import log_dist
import logging
class PropagatingThread(threading.Thread):
""" propagate exceptions to the parent's thread
refer to https://stackoverflow.com/a/31614591/9601110
"""
def run(self):
self.exc = None
try:
if hasattr(self, '_Thread__target'):
# python 2.x
self.ret = self._Thread__target(
*self._Thread__args, **self._Thread__kwargs)
else:
# python 3.x
self.ret = self._target(*self._args, **self._kwargs)
except BaseException as e:
self.exc = e
def join(self):
super(PropagatingThread, self).join()
if self.exc:
raise self.exc
return self.exc
def launch_scheduler(local_rank):
if os.environ['WORKER_RANK'] != '0':
return
if local_rank != 0:
return
def scheduler_runner():
my_env = os.environ.copy()
my_env['DMLC_ROLE'] = 'scheduler'
my_env['PS_VERBOSE'] = os.environ.get('PS_VERBOSE', '1')
nvidia_smi = f'nvidia-smi -L'
devices = os.popen(nvidia_smi).read().strip()
if 'A100' in devices:
ip_cmd = f'ip addr show eth2'
ip = os.popen(ip_cmd + ' | grep "\<inet\>" | awk \'{ print $2 }\' | awk -F "/" \'{ print $1 }\'').read().strip()
my_env['DMLC_NODE_HOST'] = ip
my_env['UCX_RDMA_CM_SOURCE_ADDRESS'] = ip
os.environ['UCX_NET_DEVICES'] = 'mlx5_2:1,eth0,eth1,eth2,eth3'
command = "python3 -c 'import byteps.server'"
subprocess.check_call(command, env=my_env,
stdout=sys.stdout, stderr=sys.stderr, shell=True)
t = PropagatingThread(target=scheduler_runner)
t.setDaemon(True)
t.start()
def get_worker0_host():
host = os.environ['WORKER_0_HOST']
return host
def get_worker0_port():
port = os.environ['WORKER_0_PORT']
return port
def setup_env(local_rank):
mp_size = mpu.get_model_parallel_world_size()
num_nodes = int(os.environ['NUM_WORKER'])
gpu_per_node = int(os.environ['GPU_PER_WORKER'])
assert gpu_per_node >= mp_size
assert gpu_per_node % mp_size == 0
os.environ['BYTEPS_RDMA_START_DEPTH'] = str(32)
os.environ['BYTEPS_RDMA_RX_DEPTH'] = str(512)
os.environ['DMLC_NUM_WORKER'] = str(gpu_per_node * num_nodes)
os.environ['DMLC_NUM_SERVER'] = str(gpu_per_node * num_nodes)
os.environ['BYTEPS_LOCAL_SIZE'] = str(gpu_per_node)
os.environ['BYTEPS_FORCE_DISTRIBUTED'] = '1'
os.environ['BYTEPS_ENABLE_IPC'] = '0'
os.environ['DMLC_PS_ROOT_PORT'] = get_worker0_port()
os.environ['DMLC_PS_ROOT_URI'] = get_worker0_host()
if 'DMLC_ENABLE_RDMA' not in os.environ:
os.environ['DMLC_ENABLE_RDMA'] = '1'
os.environ['DMLC_ENABLE_UCX'] = os.environ.get('DMLC_ENABLE_UCX', '1')
os.environ['UCX_IB_TRAFFIC_CLASS'] = '236'
os.environ['UCX_TLS'] = os.environ.get('UCX_TLS', 'rc_x,tcp,sm')
nvidia_smi = f'nvidia-smi -L'
devices = os.popen(nvidia_smi).read().strip()
if 'A100' in devices:
nic = 2 # TODO: use multiple NICs with `int(local_rank / 2)`
ip_cmd = f'ip addr show eth{nic}'
ip = os.popen(ip_cmd + ' | grep "\<inet\>" | awk \'{ print $2 }\' | awk -F "/" \'{ print $1 }\'').read().strip()
os.environ['UCX_RDMA_CM_SOURCE_ADDRESS'] = os.environ.get('UCX_RDMA_CM_SOURCE_ADDRESS', ip)
devs = os.environ.get('UCX_NET_DEVICES', f'mlx5_{nic}:1,eth0,eth1,eth2,eth3')
os.environ['UCX_NET_DEVICES'] = devs
os.environ['DMLC_NODE_HOST'] = os.environ['UCX_RDMA_CM_SOURCE_ADDRESS']
elif 'V100' in devices or 'T4' in devices:
devs = os.environ.get('UCX_NET_DEVICES', 'mlx5_2:1,eth0,eth2')
os.environ['UCX_NET_DEVICES'] = devs
else:
raise RuntimeError(f"Unknown devices: {devices}")
def launch_bps(local_rank):
log_dist(f'launch_bps({local_rank})', ranks=[-1], level=logging.DEBUG)
setup_env(local_rank)
launch_scheduler(local_rank)
================================================
FILE: src/veGiantModel/module/__init__.py
================================================
# Copyright (c) 2021, ByteDance Inc. All rights reserved.
from .dense import ColumnSerialLinear, ColumnParallelLinear
from .dense import RowSerialLinear, RowParallelLinear, MockModule
from .dense import ColumnParallelLinearTranspose, ColumnSerialLinearTranspose
__all__ = ['ColumnSerialLinear',
'ColumnParallelLinear',
'ColumnParallelLinearTranspose',
'ColumnSerialLinearTranspose',
'RowSerialLinear',
'RowParallelLinear',
'MockModule']
================================================
FILE: src/veGiantModel/module/dense.py
================================================
# Copyright (c) 2021, ByteDance Inc. All rights reserved.
# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import torch
import torch.nn as nn
import torch.autograd as autograd
# try:
# import veGiantModel
# except ImportError:
# byteGiantModel = None
class MockModule(nn.Module):
"""Module for testing model parallelism"""
pass
try:
from th_fastertransformer import Linear
class LinearFunction(autograd.Function):
@staticmethod
def forward(ctx, input_tensor, weight, bias, act_gelu=False, dropout_rate=0.0):
bias_out = torch.Tensor(0)
dropout_mask = torch.Tensor(0)
if act_gelu == True or dropout_rate > 0.0:
gitextract_cagnoq27/
├── .gitignore
├── .gitmodules
├── LICENSE
├── README.md
├── docs/
│ ├── Dockerfile
│ └── step-by-step-tutorial.md
├── examples/
│ └── gpt/
│ ├── gpt_piped.py
│ ├── initialize.py
│ ├── pretrain_gpt2.py
│ └── pretrain_gpt2_distributed.sh
└── src/
└── veGiantModel/
├── __init__.py
├── distributed/
│ └── __init__.py
├── engine/
│ ├── engine.py
│ ├── module.py
│ ├── p2p.py
│ ├── schedule.py
│ └── topology.py
├── initialize.py
├── launcher/
│ └── launch.py
├── module/
│ ├── __init__.py
│ └── dense.py
└── patcher.py
SYMBOL INDEX (275 symbols across 14 files)
FILE: examples/gpt/gpt_piped.py
class GPTModelPiped (line 19) | class GPTModelPiped(VeGiantModule):
method __init__ (line 20) | def __init__(self):
method _get_batch (line 75) | def _get_batch(self, data):
method loss_fn (line 92) | def loss_fn(self, inputs, data):
method batch_fn (line 115) | def batch_fn(self, batch, is_train:bool):
class LMLogitsPiped (line 137) | class LMLogitsPiped(MegatronModule):
method __init__ (line 138) | def __init__(self, hidden_size, vocab_size, init_method):
method forward (line 144) | def forward(self, lm_output):
class EmbeddingPiped (line 148) | class EmbeddingPiped(Embedding):
method __init__ (line 149) | def __init__(self,
method forward (line 164) | def forward(self, inputs):
class ParallelTransformerLayerPiped (line 168) | class ParallelTransformerLayerPiped(ParallelTransformerLayer):
method __init__ (line 169) | def __init__(self,
method forward (line 179) | def forward(self, inputs):
FILE: examples/gpt/initialize.py
function get_learning_rate_scheduler (line 15) | def get_learning_rate_scheduler(optimizer, lr_scheduler_builder):
function get_model (line 45) | def get_model(model_provider_func):
function get_optimizer (line 63) | def get_optimizer(model):
function setup_model_and_optimizer (line 107) | def setup_model_and_optimizer(model, optimizer, train_dataset_provider, ...
function initialize_pipeline (line 155) | def initialize_pipeline(model, optimizer, train_dataset_provider, lr_sch...
function initialize_distributed (line 159) | def initialize_distributed(num_stages, mp_size, distributed_backend='ncc...
function initialize_megatron (line 162) | def initialize_megatron(extra_args_provider=None, args_defaults={}):
FILE: examples/gpt/pretrain_gpt2.py
function _build_index_mappings (line 26) | def _build_index_mappings(name, data_prefix, documents, sizes,
class GPT2DatasetFixed (line 130) | class GPT2DatasetFixed(torch.utils.data.Dataset):
method __init__ (line 131) | def __init__(self, name, data_prefix, documents, indexed_dataset,
method __len__ (line 146) | def __len__(self):
method __getitem__ (line 151) | def __getitem__(self, idx):
function build_train_valid_test_datasets (line 181) | def build_train_valid_test_datasets(data_prefix, data_impl, splits_string,
function model_provider (line 223) | def model_provider():
function lr_scheduler_builder (line 230) | def lr_scheduler_builder(optimizer):
function pretrain (line 257) | def pretrain(model_provider, args_defaults={}):
function traing_log (line 273) | def traing_log(loss_dict, iteration):
function train_valid_test_dataset_provider (line 301) | def train_valid_test_dataset_provider(train_val_test_num_samples):
function train (line 319) | def train(engine, optimizer, lr_scheduler):
FILE: src/veGiantModel/__init__.py
function initialize (line 17) | def initialize(args,
FILE: src/veGiantModel/distributed/__init__.py
function get_model_parallel_world_size (line 4) | def get_model_parallel_world_size():
function get_model_parallel_rank (line 7) | def get_model_parallel_rank():
function get_data_parallel_world_size (line 10) | def get_data_parallel_world_size():
function get_model_parallel_group (line 13) | def get_model_parallel_group():
function get_grid (line 16) | def get_grid():
function copy_to_model_parallel_region (line 19) | def copy_to_model_parallel_region(input_):
function reduce_from_model_parallel_region (line 22) | def reduce_from_model_parallel_region(input_):
function gather_from_model_parallel_region (line 25) | def gather_from_model_parallel_region(input_):
FILE: src/veGiantModel/engine/engine.py
function is_even (line 44) | def is_even(number):
function _tensor_bytes (line 57) | def _tensor_bytes(tensor):
function _dtype_to_code (line 60) | def _dtype_to_code(dtype):
function _code_to_dtype (line 76) | def _code_to_dtype(code):
class VeGiantModelEngine (line 92) | class VeGiantModelEngine(PipelineEngine):
method overwrite (line 98) | def overwrite(self, config_params, args):
method __init__ (line 117) | def __init__(self, args,
method _profiling_func_exit (line 315) | def _profiling_func_exit(self):
method _profiling_func_enter (line 318) | def _profiling_func_enter(self, func):
method _build_data_iter (line 321) | def _build_data_iter(self, dataset):
method _exec_reduce_tied_grads (line 335) | def _exec_reduce_tied_grads(self):
method _exec_reduce_grads (line 340) | def _exec_reduce_grads(self):
method _reserve_pipe_buffers (line 350) | def _reserve_pipe_buffers(self, num_buffers):
method train_batch (line 366) | def train_batch(self, data_iter=None):
method eval_batch (line 459) | def eval_batch(self, data_iter):
method is_first_stage (line 538) | def is_first_stage(self):
method is_last_stage (line 542) | def is_last_stage(self):
method _aggregate_metric (line 546) | def _aggregate_metric(self):
method _aggregate_total_loss (line 574) | def _aggregate_total_loss(self):
method set_dataloader (line 613) | def set_dataloader(self, loader):
method set_dataiterator (line 619) | def set_dataiterator(self, iterator):
method set_batch_fn (line 625) | def set_batch_fn(self, fn):
method is_gradient_accumulation_boundary (line 630) | def is_gradient_accumulation_boundary(self):
method tput_log (line 642) | def tput_log(self, *msg):
method _next_batch (line 646) | def _next_batch(self):
method _exec_bps_forward_pass (line 679) | def _exec_bps_forward_pass(self, buffer_id):
method _exec_bps_backward_pass (line 730) | def _exec_bps_backward_pass(self, buffer_id):
method _exec_load_micro_batch (line 787) | def _exec_load_micro_batch(self, buffer_id):
method _send_tensor_meta (line 838) | def _send_tensor_meta(self, buffer, recv_stage):
method _recv_tensor_meta (line 908) | def _recv_tensor_meta(self, send_stage):
method _mp_slice (line 971) | def _mp_slice(self, x):
method _mp_view (line 975) | def _mp_view(self, x, rank):
method _exec_bps_send_partitioned_activations (line 979) | def _exec_bps_send_partitioned_activations(self, buffer_id):
method _exec_bps_send_activations (line 1010) | def _exec_bps_send_activations(self, buffer_id):
method _exec_bps_send_grads (line 1042) | def _exec_bps_send_grads(self, buffer_id):
method _exec_bps_send_partitioned_grads (line 1084) | def _exec_bps_send_partitioned_grads(self, buffer_id):
method _exec_bps_sync_all (line 1126) | def _exec_bps_sync_all(self):
method _exec_bps_sync_partitioned_grads (line 1129) | def _exec_bps_sync_partitioned_grads(self, buffer_id):
method _exec_bps_sync_grads (line 1154) | def _exec_bps_sync_grads(self, buffer_id):
method _exec_bps_sync_partitioned_activations (line 1175) | def _exec_bps_sync_partitioned_activations(self, buffer_id):
method _exec_bps_sync_activations (line 1208) | def _exec_bps_sync_activations(self, buffer_id):
method _exec_bps_recv_partitioned_activations (line 1238) | def _exec_bps_recv_partitioned_activations(self, buffer_id):
method _exec_bps_recv_activations (line 1273) | def _exec_bps_recv_activations(self, buffer_id):
method _exec_bps_recv_partitioned_grads (line 1307) | def _exec_bps_recv_partitioned_grads(self, buffer_id):
method _exec_bps_recv_grads (line 1344) | def _exec_bps_recv_grads(self, buffer_id):
method _exec_optimizer_step (line 1380) | def _exec_optimizer_step(self, lr_kwargs=None):
method _zero_grads (line 1446) | def _zero_grads(self, inputs):
method _allocate_zeros (line 1455) | def _allocate_zeros(self, shape, fp16=None, **kwargs):
method _allocate_zeros2 (line 1475) | def _allocate_zeros2(self, shape, dtype, **kwargs):
method _allocate_buffer (line 1478) | def _allocate_buffer(self, shape, num_buffers=-1, **kwargs):
method _allocate_buffer2 (line 1486) | def _allocate_buffer2(self, shape, dtype, num_buffers=-1, **kwargs):
method _allocate_buffers (line 1494) | def _allocate_buffers(self, shapes, requires_grad=False, num_buffers=-1):
method _allocate_buffers2 (line 1505) | def _allocate_buffers2(self, shapes, dtypes, requires_grad=False, num_...
method forward (line 1516) | def forward(self, *args, **kwargs):
method backward (line 1520) | def backward(self, *args, **kwargs):
method step (line 1524) | def step(self, *args, **kwargs):
method _exec_schedule (line 1546) | def _exec_schedule(self, pipe_schedule):
FILE: src/veGiantModel/engine/module.py
class VeGiantModule (line 21) | class VeGiantModule(PipelineModule):
method __init__ (line 22) | def __init__(self,
method _build (line 134) | def _build(self):
method _count_layer_params (line 188) | def _count_layer_params(self):
method _find_layer_type (line 207) | def _find_layer_type(self, layername):
method forward (line 229) | def forward(self, forward_input):
method _partition_uniform (line 285) | def _partition_uniform(self, num_items, num_parts):
method _partition_balanced (line 298) | def _partition_balanced(self, weights, num_parts, eps=1e-3):
method _partition_layers (line 315) | def _partition_layers(self, method='uniform'):
method allreduce_tied_weight_gradients (line 379) | def allreduce_tied_weight_gradients(self):
method _synchronize_tied_weights (line 385) | def _synchronize_tied_weights(self):
method _index_tied_modules (line 394) | def _index_tied_modules(self):
method partitions (line 450) | def partitions(self):
method stage_owner (line 453) | def stage_owner(self, layer_idx):
method _set_bounds (line 460) | def _set_bounds(self, start=None, stop=None):
method set_checkpoint_interval (line 470) | def set_checkpoint_interval(self, interval):
method topology (line 474) | def topology(self):
method mpu (line 478) | def mpu(self):
method num_pipeline_stages (line 481) | def num_pipeline_stages(self):
method ckpt_prefix (line 484) | def ckpt_prefix(self, checkpoints_path, tag):
method ckpt_layer_path (line 500) | def ckpt_layer_path(self, ckpt_dir, local_layer_idx):
method save_state_dict (line 510) | def save_state_dict(self, save_dir):
method load_state_dir (line 522) | def load_state_dir(self, load_dir, strict=True):
method _is_checkpointable (line 543) | def _is_checkpointable(self, funcs):
FILE: src/veGiantModel/engine/p2p.py
function init_process_groups (line 40) | def init_process_groups(grid):
function _is_valid_send_recv (line 49) | def _is_valid_send_recv(src_stage, dest_stage):
function send (line 58) | def send(tensor, dest_stage, async_op=False):
function _bps_get_name (line 73) | def _bps_get_name(src, dest, name, suffix):
function bps_send (line 76) | def bps_send(tensor, dest_stage, name, index, async_op=True):
function bps_sync (line 99) | def bps_sync(src_stage, name, index=0):
function bps_sync_all (line 110) | def bps_sync_all():
function bps_recv (line 121) | def bps_recv(tensor, src_stage, name, index=0, async_op=True):
function _send (line 144) | def _send(tensor, src_rank, group, async_op):
function send_grads (line 148) | def send_grads(tensor, grid, async_op=False):
function _recv (line 158) | def _recv(tensor, src_rank, group, async_op):
function recv_grads (line 164) | def recv_grads(tensor, grid, async_op=False):
function send_activations (line 171) | def send_activations(tensor, grid, async_op=False):
function recv_activations (line 181) | def recv_activations(tensor, grid, async_op=False):
function recv (line 187) | def recv(tensor, src_stage, async_op=False):
function barrier (line 200) | def barrier(stage_id):
function _get_send_recv_group (line 211) | def _get_send_recv_group(src_stage, dest_stage):
FILE: src/veGiantModel/engine/schedule.py
class BytePSInferenceSchedule (line 12) | class BytePSInferenceSchedule(PipeSchedule):
method __init__ (line 15) | def __init__(self, micro_batches, stages, stage_id, prefetch=True):
method steps (line 19) | def steps(self):
method num_pipe_buffers (line 69) | def num_pipe_buffers(self):
class BytePSTrainSchedule (line 81) | class BytePSTrainSchedule(TrainSchedule):
method __init__ (line 88) | def __init__(self, micro_batches, stages, stage_id, prefetch=True):
method steps (line 94) | def steps(self):
method _steps (line 100) | def _steps(self):
method _steps_no_prefetch (line 184) | def _steps_no_prefetch(self):
method num_pipe_buffers (line 231) | def num_pipe_buffers(self):
class BytePSSendActivation (line 240) | class BytePSSendActivation(BufferOpInstruction):
class BytePSRecvActivation (line 243) | class BytePSRecvActivation(BufferOpInstruction):
class BytePSSyncActivation (line 246) | class BytePSSyncActivation(BufferOpInstruction):
class BytePSSyncGrad (line 249) | class BytePSSyncGrad(BufferOpInstruction):
class BytePSSendGrad (line 252) | class BytePSSendGrad(BufferOpInstruction):
class BytePSRecvGrad (line 255) | class BytePSRecvGrad(BufferOpInstruction):
class BytePSForwardPass (line 258) | class BytePSForwardPass(BufferOpInstruction):
class BytePSBackwardPass (line 261) | class BytePSBackwardPass(BufferOpInstruction):
class BytePSSyncAll (line 264) | class BytePSSyncAll(PipeInstruction):
FILE: src/veGiantModel/engine/topology.py
class ProcessTopology (line 14) | class ProcessTopology:
method __init__ (line 27) | def __init__(self, axes, dims):
method get_rank (line 50) | def get_rank(self, **coord_kwargs):
method get_axis_names (line 66) | def get_axis_names(self):
method get_rank_repr (line 70) | def get_rank_repr(self,
method get_dim (line 104) | def get_dim(self, axis):
method get_coord (line 116) | def get_coord(self, rank):
method get_axis_comm_lists (line 133) | def get_axis_comm_lists(self, axis):
method filter_match (line 173) | def filter_match(self, **filter_kwargs):
method get_axis_list (line 198) | def get_axis_list(self, axis, idx):
method world_size (line 215) | def world_size(self):
method __str__ (line 218) | def __str__(self):
function _prime_factors (line 222) | def _prime_factors(N):
class PipeDataParallelTopology (line 237) | class PipeDataParallelTopology(ProcessTopology):
method __init__ (line 244) | def __init__(self, num_pp, num_dp):
class PipeModelDataParallelTopology (line 248) | class PipeModelDataParallelTopology(ProcessTopology):
method __init__ (line 250) | def __init__(self, num_dp, num_pp, num_mp):
class PipelineParallelGrid (line 255) | class PipelineParallelGrid:
method __init__ (line 277) | def __init__(self, topology=None, process_group=None):
method get_stage_id (line 388) | def get_stage_id(self):
method get_data_parallel_id (line 391) | def get_data_parallel_id(self):
method get_model_parallel_id (line 394) | def get_model_parallel_id(self):
method get_src_parallel_src_id (line 399) | def get_src_parallel_src_id(self):
method _build_p2p_groups (line 406) | def _build_p2p_groups(self):
method _build_grads_groups (line 430) | def _build_grads_groups(self):
method _build_activation_groups (line 471) | def _build_activation_groups(self):
method _is_grid_valid (line 510) | def _is_grid_valid(self):
method stage_to_global (line 518) | def stage_to_global(self, stage_id, **kwargs):
method stage_to_byteps (line 524) | def stage_to_byteps(self, stage_id):
method topology (line 527) | def topology(self):
method get_global_rank (line 531) | def get_global_rank(self):
method get_pipe_parallel_rank (line 534) | def get_pipe_parallel_rank(self):
method get_pipe_parallel_world_size (line 538) | def get_pipe_parallel_world_size(self):
method get_pipe_parallel_group (line 542) | def get_pipe_parallel_group(self):
method get_data_parallel_rank (line 546) | def get_data_parallel_rank(self):
method get_data_parallel_world_size (line 550) | def get_data_parallel_world_size(self):
method get_data_parallel_group (line 554) | def get_data_parallel_group(self):
method get_model_parallel_rank (line 560) | def get_model_parallel_rank(self):
method get_model_parallel_world_size (line 563) | def get_model_parallel_world_size(self):
method get_model_parallel_group (line 566) | def get_model_parallel_group(self):
method get_slice_parallel_rank (line 570) | def get_slice_parallel_rank(self):
method get_slice_parallel_world_size (line 573) | def get_slice_parallel_world_size(self):
method get_slice_parallel_group (line 576) | def get_slice_parallel_group(self):
method get_slice_parallel_src_rank (line 579) | def get_slice_parallel_src_rank(self):
FILE: src/veGiantModel/initialize.py
function add_byte_giant_model_customize_args (line 17) | def add_byte_giant_model_customize_args(parser):
function initialize_megatron (line 53) | def initialize_megatron(extra_args_provider=None, args_defaults={}):
function _init_topology (line 59) | def _init_topology(num_stages, mp_size):
function _set_random_seed (line 70) | def _set_random_seed(seed):
function init_distribute (line 81) | def init_distribute(num_stages, mp_size,
FILE: src/veGiantModel/launcher/launch.py
class PropagatingThread (line 13) | class PropagatingThread(threading.Thread):
method run (line 18) | def run(self):
method join (line 31) | def join(self):
function launch_scheduler (line 37) | def launch_scheduler(local_rank):
function get_worker0_host (line 65) | def get_worker0_host():
function get_worker0_port (line 69) | def get_worker0_port():
function setup_env (line 73) | def setup_env(local_rank):
function launch_bps (line 114) | def launch_bps(local_rank):
FILE: src/veGiantModel/module/dense.py
class MockModule (line 25) | class MockModule(nn.Module):
class LinearFunction (line 32) | class LinearFunction(autograd.Function):
method forward (line 35) | def forward(ctx, input_tensor, weight, bias, act_gelu=False, dropout_r...
method backward (line 48) | def backward(ctx, grad_out):
class FTLinear (line 60) | class FTLinear(nn.Module):
method __init__ (line 61) | def __init__(self, in_features, out_features, initializer_range=0.02, ...
method forward (line 74) | def forward(self, input_tensor):
method extra_repr (line 77) | def extra_repr(self):
class LinearTransposeFunction (line 86) | class LinearTransposeFunction(autograd.Function):
method forward (line 88) | def forward(ctx, input_tensor, weight, bias, head_num, transpose_type):
method backward (line 96) | def backward(ctx, grad_out):
class FTLinearTranspose (line 101) | class FTLinearTranspose(nn.Module):
method __init__ (line 102) | def __init__(self, in_features, out_features, head_num, transpose_type...
method forward (line 115) | def forward(self, input_tensor):
method extra_repr (line 118) | def extra_repr(self):
function column_parallel_load_hook (line 125) | def column_parallel_load_hook(module, log_fn):
function column_serial_load_hook (line 165) | def column_serial_load_hook(module, log_fn):
class ColumnSerialLinear (line 211) | class ColumnSerialLinear(MockModule):
method __init__ (line 212) | def __init__(self, in_features, out_features, initializer_range=0.02,
method forward (line 243) | def forward(self, input_tensor):
method extra_repr (line 255) | def extra_repr(self):
class ColumnParallelLinear (line 258) | class ColumnParallelLinear(nn.Module):
method __init__ (line 259) | def __init__(self, in_features, out_features, initializer_range=0.02,
method forward (line 309) | def forward(self, input_tensor):
method extra_repr (line 321) | def extra_repr(self):
class RowSerialLinear (line 324) | class RowSerialLinear(MockModule):
method __init__ (line 325) | def __init__(self, in_features, out_features, initializer_range=0.02, ...
method forward (line 365) | def forward(self, input_tensor):
method extra_repr (line 381) | def extra_repr(self):
class RowParallelLinear (line 384) | class RowParallelLinear(nn.Module):
method __init__ (line 385) | def __init__(self, in_features, out_features, initializer_range=0.02, ...
method forward (line 438) | def forward(self, input_tensor):
method extra_repr (line 450) | def extra_repr(self):
class ColumnParallelLinearTranspose (line 454) | class ColumnParallelLinearTranspose(nn.Module):
method __init__ (line 455) | def __init__(self, in_features, out_features, head_num, transpose_type...
method forward (line 497) | def forward(self, input_tensor):
method extra_repr (line 511) | def extra_repr(self):
class ColumnSerialLinearTranspose (line 514) | class ColumnSerialLinearTranspose(MockModule):
method __init__ (line 515) | def __init__(self, in_features, out_features, head_num, transpose_type...
method forward (line 547) | def forward(self, input_tensor):
method extra_repr (line 562) | def extra_repr(self):
FILE: src/veGiantModel/patcher.py
function is_unitialized (line 9) | def is_unitialized():
function initialize_model_parallel (line 14) | def initialize_model_parallel(grid):
function model_parallel_is_initialized (line 21) | def model_parallel_is_initialized():
function get_model_parallel_group (line 28) | def get_model_parallel_group():
function get_data_parallel_group (line 35) | def get_data_parallel_group():
function set_model_parallel_world_size (line 42) | def set_model_parallel_world_size(world_size):
function get_model_parallel_world_size (line 46) | def get_model_parallel_world_size():
function set_model_parallel_rank (line 51) | def set_model_parallel_rank(rank):
function get_model_parallel_rank (line 55) | def get_model_parallel_rank():
function get_model_parallel_src_rank (line 60) | def get_model_parallel_src_rank():
function get_data_parallel_world_size (line 64) | def get_data_parallel_world_size():
function get_data_parallel_rank (line 69) | def get_data_parallel_rank():
function get_pipe_parallel_rank (line 73) | def get_pipe_parallel_rank():
function destroy_model_parallel (line 76) | def destroy_model_parallel():
function get_grid (line 81) | def get_grid():
function get_topo (line 84) | def get_topo():
function _gather (line 113) | def _gather(input_):
function build_tokenizer (line 140) | def build_tokenizer(args):
Condensed preview — 22 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (239K chars).
[
{
"path": ".gitignore",
"chars": 152,
"preview": "application/cache\n*.pyc\n\n# general things to ignore\nbuild/\ndist/\n*.egg-info/\n*.egg\n*.py[cod]\n__pycache__/\n*~\n\n# due to u"
},
{
"path": ".gitmodules",
"chars": 229,
"preview": "[submodule \"third_party/megatron\"]\n\tpath = third_party/megatron\n\turl = https://github.com/NVIDIA/Megatron-LM.git\n[submod"
},
{
"path": "LICENSE",
"chars": 11357,
"preview": " Apache License\n Version 2.0, January 2004\n "
},
{
"path": "README.md",
"chars": 2617,
"preview": "# veGiantModel\nVeGiantModel is a torch based high efficient training library developed by the Applied Machine Learning t"
},
{
"path": "docs/Dockerfile",
"chars": 1330,
"preview": "FROM nvcr.io/nvidia/pytorch:21.05-py3 \n\nRUN pip3 install boto3 regex tensorboardX==1.8 wheel pybind11 ninja psutil pypro"
},
{
"path": "docs/step-by-step-tutorial.md",
"chars": 1313,
"preview": "# A Step-by-Step Tutorial\nThe goal of this tutorial is to help you run the example quickly.\n\n## Pre-requisite\npytorch:\n`"
},
{
"path": "examples/gpt/gpt_piped.py",
"chars": 6782,
"preview": "import torch\n\nfrom megatron import get_args, mpu\n\nfrom megatron.model.language_model import parallel_lm_logits, Embeddin"
},
{
"path": "examples/gpt/initialize.py",
"chars": 5958,
"preview": "import torch\nimport json\nimport veGiantModel\n\nfrom megatron import get_args, mpu\nfrom megatron.fp16 import FP16_Module\nf"
},
{
"path": "examples/gpt/pretrain_gpt2.py",
"chars": 13681,
"preview": "# Copyright (c) 2021, ByteDance Inc. All rights reserved.\n# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserve"
},
{
"path": "examples/gpt/pretrain_gpt2_distributed.sh",
"chars": 2661,
"preview": "#! /bin/bash\n# Runs the \"345M\" parameter model\n\nDATA_PATH=<Specify path where >\nCHECKPOINT_PATH=<Specify path>\n\nexport W"
},
{
"path": "src/veGiantModel/__init__.py",
"chars": 1437,
"preview": "# Copyright (c) 2021, ByteDance Inc. All rights reserved.\nimport sys\nimport os\n\ncwd = os.path.dirname(os.path.abspath(_"
},
{
"path": "src/veGiantModel/distributed/__init__.py",
"chars": 727,
"preview": "from .. import patcher as dist\nfrom megatron import mpu\n\ndef get_model_parallel_world_size():\n return dist.get_model_"
},
{
"path": "src/veGiantModel/engine/engine.py",
"chars": 68506,
"preview": "# Copyright (c) 2021, ByteDance Inc. All rights reserved.\r\n# Copyright 2019 The Microsoft DeepSpeed Team\r\nimport os\r\n\r\n"
},
{
"path": "src/veGiantModel/engine/module.py",
"chars": 23428,
"preview": "# Copyright (c) 2021, ByteDance Inc. All rights reserved.\n# Copyright 2019 The Microsoft DeepSpeed Team\nimport os\n\nimpo"
},
{
"path": "src/veGiantModel/engine/p2p.py",
"chars": 9625,
"preview": "# Copyright (c) 2021, ByteDance Inc. All rights reserved.\r\n# Copyright 2019 The Microsoft DeepSpeed Team\r\n'''\r\nCopyrigh"
},
{
"path": "src/veGiantModel/engine/schedule.py",
"chars": 11653,
"preview": "# Copyright (c) 2021, ByteDance Inc. All rights reserved.\r\nfrom deepspeed.runtime.pipe.schedule import (\r\n BufferOpI"
},
{
"path": "src/veGiantModel/engine/topology.py",
"chars": 24611,
"preview": "# Copyright (c) 2021, ByteDance Inc. All rights reserved.\r\n# Copyright 2019 The Microsoft DeepSpeed Team\r\n\r\nfrom deepsp"
},
{
"path": "src/veGiantModel/initialize.py",
"chars": 5111,
"preview": "# Copyright (c) 2021, ByteDance Inc. All rights reserved.\n# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserve"
},
{
"path": "src/veGiantModel/launcher/launch.py",
"chars": 4245,
"preview": "# Copyright (c) 2021, ByteDance Inc. All rights reserved.\n#!/usr/bin/python\n\nfrom __future__ import print_function\nimpo"
},
{
"path": "src/veGiantModel/module/__init__.py",
"chars": 505,
"preview": "# Copyright (c) 2021, ByteDance Inc. All rights reserved.\nfrom .dense import ColumnSerialLinear, ColumnParallelLinear\nf"
},
{
"path": "src/veGiantModel/module/dense.py",
"chars": 26941,
"preview": "# Copyright (c) 2021, ByteDance Inc. All rights reserved.\n# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserve"
},
{
"path": "src/veGiantModel/patcher.py",
"chars": 5701,
"preview": "# Copyright (c) 2021, ByteDance Inc. All rights reserved.\n# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserve"
}
]
About this extraction
This page contains the full source code of the volcengine/veGiantModel GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 22 files (223.2 KB), approximately 49.6k tokens, and a symbol index with 275 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.