[
  {
    "path": "ddp_tutorial.md",
    "content": "# Distributed data parallel training in Pytorch\n\n## Motivation\n\nThe easiest way to speed up neural network training is to use a GPU, which provides large speedups over CPUs on the types of calculations (matrix multiplies and additions) that are common in neural networks. As the model or dataset gets bigger, one GPU quickly becomes insufficient. For example, big language models such as [BERT](https://arxiv.org/abs/1810.04805) and [GPT-2](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) are trained on hundreds of GPUs. To multi-GPU training, we must have a way to split the model and data between different GPUs and to coordinate the training. \n\n\n### Why distributed data parallel?\n\nI like to implement my models in Pytorch because I find it has the best balance between control and ease of use of the major neural-net frameworks. Pytorch has two ways to split models and data across multiple GPUs: [`nn.DataParallel`](https://pytorch.org/docs/stable/nn.html#dataparallel) and [`nn.DistributedDataParallel`](https://pytorch.org/docs/stable/nn.html#distributeddataparallel). `nn.DataParallel` is easier to use (just wrap the model and run your training script). However, because it uses one process to compute the model weights and then distribute them to each GPU during each batch, networking quickly becomes a bottle-neck and GPU utilization is often very low. Furthermore, `nn.DataParallel` requires that all the GPUs be on the same node and doesn't work with [Apex](https://nvidia.github.io/apex/amp.html) for [mixed-precision](https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/) training. \n\n### The existing documentation is insufficient\n\nIn general, the Pytorch documentation is thorough and clear, especially in version 1.0.x. I taught myself Pytorch almost entirely from the documentation and tutorials: this is definitely much more a reflection on Pytorch's ease of use and excellent documentation than it is any special ability on my part. So I was very surprised when I spent some time trying to figure out how to use `DistributedDataParallel` and found all of the examples and tutorials to be some combination of inaccessible, incomplete, or overloaded with irrelevant features. \n\nPytorch provides a [tutorial](https://pytorch.org/tutorials/beginner/aws_distributed_training_tutorial.html) on distributed training using AWS, which does a pretty good job of showing you how to set things up on the AWS side. However, the rest of it is a bit messy, as it spends a lot of time showing how to calculate metrics for some reason before going back to showing how to wrap your model and launch the processes. It also doesn't describe what `nn.DistributedDataParallel` does, which makes the relevant code blocks difficult to follow. \n\nThe [tutorial](https://pytorch.org/tutorials/intermediate/dist_tuto.html) on writing distributed applications in Pytorch has much more detail than necessary for a first pass and is not accessible to somebody without a strong background on multiprocessing in Python. It spends a lot of time replicating the functionality in `nn.DistributedDataParallel`. However, it doesn't give a high-level overview of what it does and provides no insight on how to *use* it. \n(https://pytorch.org/tutorials/intermediate/ddp_tutorial.html)\n\nThere's also a Pytorch [tutorial](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html) on getting started with distributed data parallel. This one shows how to do some setup, but doesn't explain what the setup is for, and then shows some code to split a model across GPUs and do one optimization step. Unfortunately, I'm pretty sure the code as written won't run (the function names don't match up) and furthermore it doesn't tell you *how* to run the code. Like the previous tutorial, it also doesn't give a high-level overview of how distributed training works. \n\nThe closest to a MWE example Pytorch provides is the [Imagenet](https://github.com/pytorch/examples/tree/master/imagenet) training example. Unfortunately, that example also demonstrates pretty much every other feature Pytorch has, so it's difficult to pick out what pertains to distributed, multi-GPU training. \n\nApex provides their own [version](https://github.com/NVIDIA/apex/tree/master/examples/imagenet) of the Pytorch Imagenet example. The documentation there tells you that their version of `nn.DistributedDataParallel` is a drop-in replacement for Pytorch's, which is only helpful after learning how to use Pytorch's. \n\nThis [tutorial](http://www.telesens.co/2019/04/04/distributed-data-parallel-training-using-pytorch-on-aws/) has a good description of what's going on under the hood and how it's different from `nn.DataParallel`. However, it doesn't have code examples of how to use `nn.DataParallel`.\n\n## Outline\n\nThis tutorial is really directed at people who are already familiar with training neural network models in Pytorch, and I won't go over any of those parts of the code. I'll begin by summarizing the big picture. I then show a minimum working example of training on MNIST using on GPU. I modify this example to train on multiple GPUs, possibly across multiple nodes, and explain the changes line by line. Importantly, I also explain how to run the code. As a bonus, I also demonstrate how to use Apex to do easy mixed-precision distribued training. \n\n## The big picture\n\nMultiprocessing with `DistributedDataParallel` duplicates the model across multiple GPUs, each of which is controlled by one process. (If you want, you can have each process control multiple GPUs, but that should be obviously slower than having one GPU per process. It's also possible to have multiple worker processes that fetch data for each GPU, but I'm going to leave that out for the sake of simplicity.) The GPUs can all be on the same node or spread across multiple nodes. Every process does identical tasks, and each process communicates with all the others. Only gradients are passed between the processes/GPUs so that network communication is less of a bottleneck. \n\n![figure](graphics/processes-gpus.png)\n\nDuring training, each process loads its own minibatches from disk and passes them to its GPU. Each GPU does its own forward pass, and then the gradients are all-reduced across the GPUs. Gradients for each layer do not depend on previous layers, so the gradient all-reduce is calculated concurrently with the backwards pass to futher alleviate the networking bottleneck. At the end of the backwards pass, every node has the averaged gradients, ensuring that the model weights stay synchronized. \n\nAll this requires that the multiple processes, possibly on multiple nodes, are synchronized and communicate. Pytorch does this through its [`distributed.init_process_group`](https://pytorch.org/docs/stable/distributed.html#initialization) function. This function needs to know where to find process 0 so that all the processes can sync up and the total number of processes to expect. Each individual process also needs to know the total number of processes as well as its rank within the processes and which GPU to use. It's common to call the total number of processes the *world size*. Finally, each process needs to know which slice of the data to work on so that the batches are non-overlapping. Pytorch provides [`nn.utils.data.DistributedSampler`](https://pytorch.org/docs/stable/_modules/torch/utils/data/distributed.html) to accomplish this. \n\n## Minimum working examples with explanations\n\nTo demonstrate how to do this, I'll create an example that [trains on MNIST](https://github.com/yangkky/distributed_tutorial/blob/master/src/mnist.py), and then modify it to run on [multiple GPUs across multiple nodes](https://github.com/yangkky/distributed_tutorial/blob/master/src/mnist-distributed.py), and finally to also allow [mixed-precision training](https://github.com/yangkky/distributed_tutorial/blob/master/src/mnist-mixed.py). \n\n### Without multiprocessing\n\nFirst, we import everything we need. \n\n```python {.line-numbers}\nimport os\nfrom datetime import datetime\nimport argparse\nimport torch.multiprocessing as mp\nimport torchvision\nimport torchvision.transforms as transforms\nimport torch\nimport torch.nn as nn\nimport torch.distributed as dist\nfrom apex.parallel import DistributedDataParallel as DDP\nfrom apex import amp\n```\n\nWe define a very simple convolutional model for predicting MNIST. \n\n```python\nclass ConvNet(nn.Module):\n    def __init__(self, num_classes=10):\n        super(ConvNet, self).__init__()\n        self.layer1 = nn.Sequential(\n            nn.Conv2d(1, 16, kernel_size=5, stride=1, padding=2),\n            nn.BatchNorm2d(16),\n            nn.ReLU(),\n            nn.MaxPool2d(kernel_size=2, stride=2))\n        self.layer2 = nn.Sequential(\n            nn.Conv2d(16, 32, kernel_size=5, stride=1, padding=2),\n            nn.BatchNorm2d(32),\n            nn.ReLU(),\n            nn.MaxPool2d(kernel_size=2, stride=2))\n        self.fc = nn.Linear(7*7*32, num_classes)\n\n    def forward(self, x):\n        out = self.layer1(x)\n        out = self.layer2(out)\n        out = out.reshape(out.size(0), -1)\n        out = self.fc(out)\n        return out\n```\n\nThe `main()` function will take in some arguments and run the training function. \n\n```python\ndef main():\n    parser = argparse.ArgumentParser()\n    parser.add_argument('-n', '--nodes', default=1, type=int, metavar='N')\n    parser.add_argument('-g', '--gpus', default=1, type=int,\n                        help='number of gpus per node')\n    parser.add_argument('-nr', '--nr', default=0, type=int,\n                        help='ranking within the nodes')\n    parser.add_argument('--epochs', default=2, type=int, metavar='N',\n                        help='number of total epochs to run')\n    args = parser.parse_args()\n    train(0, args)\n```\n\nAnd here's the train function. \n\n```python\ndef train(gpu, args):\n    model = ConvNet()\n    torch.cuda.set_device(gpu)\n    model.cuda(gpu)\n    batch_size = 100\n    # define loss function (criterion) and optimizer\n    criterion = nn.CrossEntropyLoss().cuda(gpu)\n    optimizer = torch.optim.SGD(model.parameters(), 1e-4)\n    # Data loading code\n    train_dataset = torchvision.datasets.MNIST(root='./data',\n                                               train=True,\n                                               transform=transforms.ToTensor(),\n                                               download=True)\n    train_loader = torch.utils.data.DataLoader(dataset=train_dataset,\n                                               batch_size=batch_size,\n                                               shuffle=True,\n                                               num_workers=0,\n                                               pin_memory=True)\n\n    start = datetime.now()\n    total_step = len(train_loader)\n    for epoch in range(args.epochs):\n        for i, (images, labels) in enumerate(train_loader):\n            images = images.cuda(non_blocking=True)\n            labels = labels.cuda(non_blocking=True)\n            # Forward pass\n            outputs = model(images)\n            loss = criterion(outputs, labels)\n\n            # Backward and optimize\n            optimizer.zero_grad()\n            loss.backward()\n            optimizer.step()\n            if (i + 1) % 100 == 0 and gpu == 0:\n                print('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}'.format(epoch + 1, args.epochs, i + 1, total_step,\n                                                                         loss.item()))\n    if gpu == 0:\n        print(\"Training complete in: \" + str(datetime.now() - start))\n```\n\nFinally, we want to make sure the `main()` function gets called. \n\n```python\nif __name__ == '__main__':\n    main()\n```\n\nThere's definitely some extra stuff in here (the number of gpus and nodes, for example) that we don't need yet, but it's helpful to put the whole skeleton in place. \n\nWe can run this code by opening a terminal and typing `python src/mnist.py -n 1 -g 1 -nr 0`, which will train on a single gpu on a single node. \n\n### With multiprocessing\n\nTo do this with multiprocessing, we need a script that will launch a process for every GPU. Each process needs to know which GPU to use, and where it ranks amongst all the processes that are running. We'll need to run the script on each node. \n\nLet's take a look at the changes to each function. I've fenced off the new code to make it easy to find. \n\n```python\ndef main():\n    parser = argparse.ArgumentParser()\n    parser.add_argument('-n', '--nodes', default=1, type=int, metavar='N')\n    parser.add_argument('-g', '--gpus', default=1, type=int,\n                        help='number of gpus per node')\n    parser.add_argument('-nr', '--nr', default=0, type=int,\n                        help='ranking within the nodes')\n    parser.add_argument('--epochs', default=2, type=int, metavar='N',\n                        help='number of total epochs to run')\n    args = parser.parse_args()\n    #########################################################\n    args.world_size = args.gpus * args.nodes                #\n    os.environ['MASTER_ADDR'] = '10.57.23.164'              #\n    os.environ['MASTER_PORT'] = '8888'                      #\n    mp.spawn(train, nprocs=args.gpus, args=(args,))         #\n    #########################################################\n```\n\nI hand-waved over the arguments in the last section, but now we actually need them. \n\n- `args.nodes` is the total number of nodes we're going to use. \n- `args.gpus` is the number of gpus on each node. \n- `args.nr` is the rank of the current node within all the nodes, and goes from 0 to `args.nodes` - 1. \n\nNow, let's go through the new changes line by line: \n\nLine 12: Based on the number of nodes and gpus per node, we can calculate the `world_size`, or the total number of processes to run, which is equal to the total number of gpus because we're assigning one gpu to every process. \n\nLine 13: This tells the multiprocessing module what IP address to look at for process 0. It needs this so that all the processes can sync up initially. \n\nLine 14: Likewise, this is the port to use when looking for process 0. \n\nLine 15: Now, instead of running the train function once, we will spawn `args.gpus` processes, each of which runs `train(i, args)`, where `i` goes from 0 to `args.gpus` - 1. Remember, we run the `main()` function on each node, so that in total there will be `args.nodes` * `args.gpus` = `args.world_size` processes. \n\nInstead of lines 13 and 14, I could have run `export MASTER_ADDR=10.57.23.164` and `export MASTER_PORT=8888` in the terminal. \n\nNext, let's look at the modifications to `train`. I'll fence the new lines again. \n\n```python\ndef train(gpu, args):\n    ######################################################################\n    rank = args.nr * args.gpus + gpu\t                          \n    dist.init_process_group(                                   \n    \tbackend='nccl',                                         \n   \t\tinit_method='env://',                                   \n    \tworld_size=args.world_size,                              \n    \trank=rank                                               \n    )                                                          \n    ######################################################################\n    \n    model = ConvNet()\n    torch.cuda.set_device(gpu)\n    model.cuda(gpu)\n    batch_size = 100\n    # define loss function (criterion) and optimizer\n    criterion = nn.CrossEntropyLoss().cuda(gpu)\n    optimizer = torch.optim.SGD(model.parameters(), 1e-4)\n    \n    ######################################################################\n    # Wrap the model\n    model = nn.parallel.DistributedDataParallel(model, device_ids=[gpu])\n    ######################################################################\n\n    # Data loading code\n    train_dataset = torchvision.datasets.MNIST(root='./data',\n                                               train=True,\n                                               transform=transforms.ToTensor(),\n                                               download=True)\n                                               \n    ######################################################################\n    train_sampler = torch.utils.data.distributed.DistributedSampler(\n    \ttrain_dataset,\n    \tnum_replicas=args.world_size,\n    \trank=rank\n    )\n    ######################################################################\n\n    train_loader = torch.utils.data.DataLoader(dataset=train_dataset,\n                                               batch_size=batch_size,\n    ######################################################################\n                                               shuffle=False,            #\n    ######################################################################\n                                               num_workers=0,\n                                               pin_memory=True,\n    ######################################################################\n                                               sampler=train_sampler)    # \n    ######################################################################\n\n    start = datetime.now()\n    total_step = len(train_loader)\n    for epoch in range(args.epochs):\n        for i, (images, labels) in enumerate(train_loader):\n            images = images.cuda(non_blocking=True)\n            labels = labels.cuda(non_blocking=True)\n            # Forward pass\n            outputs = model(images)\n            loss = criterion(outputs, labels)\n\n            # Backward and optimize\n            optimizer.zero_grad()\n            loss.backward()\n            optimizer.step()\n            if (i + 1) % 100 == 0 and gpu == 0:\n                print('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}'.format(epoch + 1, args.epochs, i + 1, total_step,\n                                                                         loss.item()))\n    if gpu == 0:\n        print(\"Training complete in: \" + str(datetime.now() - start))\n```\n\nLine 3: This is the global rank of the process within all of the processes (one process per GPU). We'll use this for line 6. \n\nLines 4 - 6: Initialize the process and join up with the other processes. This is \"blocking,\" meaning that no process will continue until all processes have joined. I'm using the `nccl` backend here because the [pytorch docs](https://pytorch.org/docs/stable/distributed.html) say it's the fastest of the available ones. The `init_method` tells the process group where to look for some settings. In this case, it's looking at environment variables for the `MASTER_ADDR` and `MASTER_PORT`, which we set within `main`. I could have set the `world_size` there as well as `WORLD_SIZE`, but I'm choosing to set it here as a keyword argument, along with the global rank of the current process. \n\nLine 23: Wrap the model as a [`DistributedDataParallel`](https://pytorch.org/docs/stable/nn.html#distributeddataparallel) model. This reproduces the model onto the GPU for the process. \n\nLines 32-36: The [`nn.utils.data.DistributedSampler`](https://pytorch.org/docs/stable/_modules/torch/utils/data/distributed.html) makes sure that each process gets a different slice of the training data. \n\nLines 42 and 47: Use the `nn.utils.data.DistributedSampler` instead of shuffling the usual way. \n\nTo run this on, say, 4 nodes with 8 GPUs each, we need 4 terminals (one on each node). On node 0 (as set by line 13 in `main`): \n\n```python src/mnist-distributed.py -n 4 -g 8 -nr 0```\n\nThen, on the other nodes: \n\n```python src/mnist-distributed.py -n 4 -g 8 -nr i```\n\nfor $i \\in \\{1, 2, 3\\}$. In other words, we run this script on each node, telling it to launch `args.gpus` processes that sync with each other before training begins. \n\nNote that the effective batchsize is now the per/GPU batchsize (the value in the script) * the total number of GPUs (the worldsize). \n\n\n### With Apex for mixed precision\n\nMixed precision training (training in a combination of float (FP32) and half (FP16) precision) allows us to use larger batch sizes and take advantage of NVIDIA [Tensor Cores](https://www.nvidia.com/en-us/data-center/tensorcore/) for faster computation. AWS [p3](https://aws.amazon.com/ec2/instance-types/p3/) instances use NVIDIA Tesla V100 GPUs with Tensor Cores. We only need to change the `train` function. For the sake of concision, I've taken out the data loading code and the code after the backwards pass from the example here, replacing them with `...`, but they are still in the [full script](https://github.com/yangkky/distributed_tutorial/blob/master/src/mnist-mixed.py). \n\n```python\ndef train(gpu, args):\n    rank = args.nr * args.gpus + gpu\n    dist.init_process_group(\n        backend='nccl',\n        init_method='env://',\n        world_size=args.world_size,\n        rank=rank)\n\n    model = ConvNet()\n    torch.cuda.set_device(gpu)\n    model.cuda(gpu)\n    batch_size = 100\n    # define loss function (criterion) and optimizer\n    criterion = nn.CrossEntropyLoss().cuda(gpu)\n    optimizer = torch.optim.SGD(model.parameters(), 1e-4)\n    # Wrap the model\n    ######################################################################\n    model, optimizer = amp.initialize(model, optimizer, opt_level='O2')\n    model = DDP(model)\n    ######################################################################\n    # Data loading code\n\t...\n    start = datetime.now()\n    total_step = len(train_loader)\n    for epoch in range(args.epochs):\n        for i, (images, labels) in enumerate(train_loader):\n            images = images.cuda(non_blocking=True)\n            labels = labels.cuda(non_blocking=True)\n            # Forward pass\n            outputs = model(images)\n            loss = criterion(outputs, labels)\n\n            # Backward and optimize\n            optimizer.zero_grad()\n    ######################################################################\n            with amp.scale_loss(loss, optimizer) as scaled_loss:\n                scaled_loss.backward()\n    ######################################################################\n            optimizer.step()\n     ...\n```\n\nLine 17: [`amp.initialize`](https://nvidia.github.io/apex/amp.html#unified-api) wraps the model and optimizer for mixed precision training. Note that that the model must already be on the correct GPU before calling `amp.initialize`. The `opt_level` goes from `O0`, which uses all floats, through `O3`, which uses half-precision throughout. `O1` and `O2` are different degrees of mixed-precision, the details of which can be found in the Apex [documentation](https://nvidia.github.io/apex/amp.html#opt-levels-and-properties). Yes, the first character in all those codes is a capital letter 'O', while the second character is a number. Yes, if you use a zero instead, you will get a baffling error message. \n\nLine 18: [`apex.parallel.DistributedDataParallel`](https://nvidia.github.io/apex/parallel.html) is a drop-in replacement for `nn.DistributedDataParallel`. We no longer have to specify the GPUs because Apex only allows one GPU per process. It also assumes that the script calls `torch.cuda.set_device(local_rank)`(line 10) before moving the model to GPU. \n\nLines 36-37: Mixed-precision training requires that the loss is [scaled](https://devblogs.nvidia.com/mixed-precision-training-deep-neural-networks/) in order to prevent the gradients from underflowing. Apex does this automatically. \n\nThis script is run the same way as the distributed training script. \n\n## Acknowledgments\n\nMany thanks to the computational team at VL56 for all your work on various parts of this. I'd like to especially thank Stephen Kottman, who got a MWE up while I was still trying to figure out how multiprocessing in Python works, and then explained it to me, and Andy Beam, who greatly improved the first draft of this tutorial. "
  },
  {
    "path": "src/mnist-distributed.py",
    "content": "import os\nfrom datetime import datetime\nimport argparse\nimport torch.multiprocessing as mp\nimport torchvision\nimport torchvision.transforms as transforms\nimport torch\nimport torch.nn as nn\nimport torch.distributed as dist\nfrom apex.parallel import DistributedDataParallel as DDP\nfrom apex import amp\n\n\ndef main():\n    parser = argparse.ArgumentParser()\n    parser.add_argument('-n', '--nodes', default=1, type=int, metavar='N',\n                        help='number of data loading workers (default: 4)')\n    parser.add_argument('-g', '--gpus', default=1, type=int,\n                        help='number of gpus per node')\n    parser.add_argument('-nr', '--nr', default=0, type=int,\n                        help='ranking within the nodes')\n    parser.add_argument('--epochs', default=2, type=int, metavar='N',\n                        help='number of total epochs to run')\n    args = parser.parse_args()\n    args.world_size = args.gpus * args.nodes\n    os.environ['MASTER_ADDR'] = '10.57.23.164'\n    os.environ['MASTER_PORT'] = '8888'\n    mp.spawn(train, nprocs=args.gpus, args=(args,))\n\n\nclass ConvNet(nn.Module):\n    def __init__(self, num_classes=10):\n        super(ConvNet, self).__init__()\n        self.layer1 = nn.Sequential(\n            nn.Conv2d(1, 16, kernel_size=5, stride=1, padding=2),\n            nn.BatchNorm2d(16),\n            nn.ReLU(),\n            nn.MaxPool2d(kernel_size=2, stride=2))\n        self.layer2 = nn.Sequential(\n            nn.Conv2d(16, 32, kernel_size=5, stride=1, padding=2),\n            nn.BatchNorm2d(32),\n            nn.ReLU(),\n            nn.MaxPool2d(kernel_size=2, stride=2))\n        self.fc = nn.Linear(7*7*32, num_classes)\n\n    def forward(self, x):\n        out = self.layer1(x)\n        out = self.layer2(out)\n        out = out.reshape(out.size(0), -1)\n        out = self.fc(out)\n        return out\n\n\ndef train(gpu, args):\n    rank = args.nr * args.gpus + gpu\n    dist.init_process_group(backend='nccl', init_method='env://', world_size=args.world_size, rank=rank)\n    torch.manual_seed(0)\n    model = ConvNet()\n    torch.cuda.set_device(gpu)\n    model.cuda(gpu)\n    batch_size = 100\n    # define loss function (criterion) and optimizer\n    criterion = nn.CrossEntropyLoss().cuda(gpu)\n    optimizer = torch.optim.SGD(model.parameters(), 1e-4)\n    # Wrap the model\n    model = nn.parallel.DistributedDataParallel(model, device_ids=[gpu])\n    # Data loading code\n    train_dataset = torchvision.datasets.MNIST(root='./data',\n                                               train=True,\n                                               transform=transforms.ToTensor(),\n                                               download=True)\n    train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset,\n                                                                    num_replicas=args.world_size,\n                                                                    rank=rank)\n    train_loader = torch.utils.data.DataLoader(dataset=train_dataset,\n                                               batch_size=batch_size,\n                                               shuffle=False,\n                                               num_workers=0,\n                                               pin_memory=True,\n                                               sampler=train_sampler)\n\n    start = datetime.now()\n    total_step = len(train_loader)\n    for epoch in range(args.epochs):\n        for i, (images, labels) in enumerate(train_loader):\n            images = images.cuda(non_blocking=True)\n            labels = labels.cuda(non_blocking=True)\n            # Forward pass\n            outputs = model(images)\n            loss = criterion(outputs, labels)\n\n            # Backward and optimize\n            optimizer.zero_grad()\n            loss.backward()\n            optimizer.step()\n            if (i + 1) % 100 == 0 and gpu == 0:\n                print('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}'.format(epoch + 1, args.epochs, i + 1, total_step,\n                                                                         loss.item()))\n    if gpu == 0:\n        print(\"Training complete in: \" + str(datetime.now() - start))\n\n\nif __name__ == '__main__':\n    main()\n"
  },
  {
    "path": "src/mnist-mixed.py",
    "content": "import os\nfrom datetime import datetime\nimport argparse\nimport torch.multiprocessing as mp\nimport torchvision\nimport torchvision.transforms as transforms\nimport torch\nimport torch.nn as nn\nimport torch.distributed as dist\nfrom apex.parallel import DistributedDataParallel as DDP\nfrom apex import amp\n\n\ndef main():\n    parser = argparse.ArgumentParser()\n    parser.add_argument('-n', '--nodes', default=1, type=int, metavar='N',\n                        help='number of data loading workers (default: 4)')\n    parser.add_argument('-g', '--gpus', default=1, type=int,\n                        help='number of gpus per node')\n    parser.add_argument('-nr', '--nr', default=0, type=int,\n                        help='ranking within the nodes')\n    parser.add_argument('--epochs', default=2, type=int, metavar='N',\n                        help='number of total epochs to run')\n    args = parser.parse_args()\n    args.world_size = args.gpus * args.nodes\n    os.environ['MASTER_ADDR'] = 'localhost'\n    os.environ['MASTER_PORT'] = '8888'\n    mp.spawn(train, nprocs=args.gpus, args=(args,))\n\n\nclass ConvNet(nn.Module):\n    def __init__(self, num_classes=10):\n        super(ConvNet, self).__init__()\n        self.layer1 = nn.Sequential(\n            nn.Conv2d(1, 16, kernel_size=5, stride=1, padding=2),\n            nn.BatchNorm2d(16),\n            nn.ReLU(),\n            nn.MaxPool2d(kernel_size=2, stride=2))\n        self.layer2 = nn.Sequential(\n            nn.Conv2d(16, 32, kernel_size=5, stride=1, padding=2),\n            nn.BatchNorm2d(32),\n            nn.ReLU(),\n            nn.MaxPool2d(kernel_size=2, stride=2))\n        self.fc = nn.Linear(7*7*32, num_classes)\n\n    def forward(self, x):\n        out = self.layer1(x)\n        out = self.layer2(out)\n        out = out.reshape(out.size(0), -1)\n        out = self.fc(out)\n        return out\n\n\ndef train(gpu, args):\n    rank = args.nr * args.gpus + gpu\n    dist.init_process_group(\n        backend='nccl',\n        init_method='env://',\n        world_size=args.world_size,\n        rank=rank)\n    torch.manual_seed(0)\n    model = ConvNet()\n    torch.cuda.set_device(gpu)\n    model.cuda(gpu)\n    batch_size = 100\n    # define loss function (criterion) and optimizer\n    criterion = nn.CrossEntropyLoss().cuda(gpu)\n    optimizer = torch.optim.SGD(model.parameters(), 1e-4)\n    # Wrap the model\n    model, optimizer = amp.initialize(model, optimizer, opt_level='O2')\n    model = DDP(model)\n    # Data loading code\n    train_dataset = torchvision.datasets.MNIST(\n        root='./data',\n        train=True,\n        transform=transforms.ToTensor(),\n        download=True\n    )\n    train_sampler = torch.utils.data.distributed.DistributedSampler(\n        train_dataset,\n        num_replicas=args.world_size,\n        rank=rank)\n    train_loader = torch.utils.data.DataLoader(\n        dataset=train_dataset,\n        batch_size=batch_size,\n        shuffle=False,\n        num_workers=0,\n        pin_memory=True,\n        sampler=train_sampler\n    )\n\n    start = datetime.now()\n    total_step = len(train_loader)\n    for epoch in range(args.epochs):\n        for i, (images, labels) in enumerate(train_loader):\n            images = images.cuda(non_blocking=True)\n            labels = labels.cuda(non_blocking=True)\n            # Forward pass\n            outputs = model(images)\n            loss = criterion(outputs, labels)\n\n            # Backward and optimize\n            optimizer.zero_grad()\n            with amp.scale_loss(loss, optimizer) as scaled_loss:\n                scaled_loss.backward()\n            optimizer.step()\n            if (i + 1) % 100 == 0 and gpu == 0:\n                print('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}'.format(\n                    epoch + 1,\n                    args.epochs,\n                    i + 1,\n                    total_step,\n                    loss.item())\n                )\n    if gpu == 0:\n        print(\"Training complete in: \" + str(datetime.now() - start))\n\n\nif __name__ == '__main__':\n    main()\n"
  },
  {
    "path": "src/mnist.py",
    "content": "import os\nfrom datetime import datetime\nimport argparse\nimport torch.multiprocessing as mp\nimport torchvision\nimport torchvision.transforms as transforms\nimport torch\nimport torch.nn as nn\nimport torch.distributed as dist\nfrom apex.parallel import DistributedDataParallel as DDP\nfrom apex import amp\n\n\ndef main():\n    parser = argparse.ArgumentParser()\n    parser.add_argument('-n', '--nodes', default=1, type=int, metavar='N',\n                        help='number of data loading workers (default: 4)')\n    parser.add_argument('-g', '--gpus', default=1, type=int,\n                        help='number of gpus per node')\n    parser.add_argument('-nr', '--nr', default=0, type  =int,\n                        help='ranking within the nodes')\n    parser.add_argument('--epochs', default=2, type=int, metavar='N',\n                        help='number of total epochs to run')\n    args = parser.parse_args()\n    train(0, args)\n\n\nclass ConvNet(nn.Module):\n    def __init__(self, num_classes=10):\n        super(ConvNet, self).__init__()\n        self.layer1 = nn.Sequential(\n            nn.Conv2d(1, 16, kernel_size=5, stride=1, padding=2),\n            nn.BatchNorm2d(16),\n            nn.ReLU(),\n            nn.MaxPool2d(kernel_size=2, stride=2))\n        self.layer2 = nn.Sequential(\n            nn.Conv2d(16, 32, kernel_size=5, stride=1, padding=2),\n            nn.BatchNorm2d(32),\n            nn.ReLU(),\n            nn.MaxPool2d(kernel_size=2, stride=2))\n        self.fc = nn.Linear(7*7*32, num_classes)\n\n    def forward(self, x):\n        out = self.layer1(x)\n        out = self.layer2(out)\n        out = out.reshape(out.size(0), -1)\n        out = self.fc(out)\n        return out\n\n\ndef train(gpu, args):\n    model = ConvNet()\n    torch.cuda.set_device(gpu)\n    model.cuda(gpu)\n    batch_size = 100\n    # define loss function (criterion) and optimizer\n    criterion = nn.CrossEntropyLoss().cuda(gpu)\n    optimizer = torch.optim.SGD(model.parameters(), 1e-4)\n    # Data loading code\n    train_dataset = torchvision.datasets.MNIST(root='./data',\n                                               train=True,\n                                               transform=transforms.ToTensor(),\n                                               download=True)\n    train_loader = torch.utils.data.DataLoader(dataset=train_dataset,\n                                               batch_size=batch_size,\n                                               shuffle=True,\n                                               num_workers=0,\n                                               pin_memory=True)\n\n    start = datetime.now()\n    total_step = len(train_loader)\n    for epoch in range(args.epochs):\n        for i, (images, labels) in enumerate(train_loader):\n            images = images.cuda(non_blocking=True)\n            labels = labels.cuda(non_blocking=True)\n            # Forward pass\n            outputs = model(images)\n            loss = criterion(outputs, labels)\n\n            # Backward and optimize\n            optimizer.zero_grad()\n            loss.backward()\n            optimizer.step()\n            if (i + 1) % 100 == 0 and gpu == 0:\n                print('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}'.format(epoch + 1, args.epochs, i + 1, total_step,\n                                                                         loss.item()))\n    if gpu == 0:\n        print(\"Training complete in: \" + str(datetime.now() - start))\n\n\nif __name__ == '__main__':\n    main()"
  }
]