Repository: eminorhan/baby-vision
Branch: master
Commit: f0340acd873d
Files: 32
Total size: 130.0 KB

Directory structure:
gitextract_uh48hnz9/

├── .gitignore
├── LICENSE
├── README.md
├── feature_animation.py
├── feature_animation_class.py
├── highly_activating_imgs.py
├── hog_baseline.py
├── imagenet_finetuning.py
├── linear_combination_maps.py
├── linear_decoding.py
├── moco/
│   ├── __init__.py
│   ├── builder.py
│   └── loader.py
├── moco_img.py
├── moco_temp.py
├── moco_utils.py
├── read_saycam.py
├── scripts/
│   ├── feature_animation.sh
│   ├── feature_animation_class.sh
│   ├── highly_activating_imgs.sh
│   ├── hog_baseline.sh
│   ├── imagenet_finetuning.sh
│   ├── linear_combination_maps.sh
│   ├── linear_decoding.sh
│   ├── moco_img.sh
│   ├── moco_temp.sh
│   ├── read_saycam.sh
│   ├── selectivities.sh
│   └── temporal_classification.sh
├── selectivities.py
├── temporal_classification.py
└── utils.py

================================================
FILE CONTENTS
================================================

================================================
FILE: .gitignore
================================================
# File extensions
*.out
*.mp4

# Directories
/__pychache__
/moco/__pychache__


================================================
FILE: LICENSE
================================================
MIT License

Copyright (c) 2021 Emin Orhan

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.


================================================
FILE: README.md
================================================
# Self-supervised learning through the eyes of a child

This repository contains code for reproducing the results reported in the following paper:

Orhan AE, Gupta VV, Lake BM (2020) [Self-supervised learning through the eyes of a child.](https://arxiv.org/abs/2007.16189) *Advances in Neural Information Processing Systems 34 (NeurIPS 2020)*.

## Requirements

* pytorch == 1.5.1
* torchvision == 0.6.1

Slightly older or newer versions will probably work fine as well.

## Datasets

This project uses the SAYCam dataset described in the following paper: 

Sullivan J, Mei M, Perfors A, Wojcik EH, Frank MC (2020) [SAYCam: A large, longitudinal audiovisual dataset recorded from the infant’s perspective.](https://psyarxiv.com/fy8zx/) PsyArXiv.

The dataset is hosted on the [Databrary](https://nyu.databrary.org/) repository for behavioral science. Unfortunately, we are unable to publicly share the SAYCam dataset here due to the terms of use. However, interested researchers can apply for access to the dataset with approval from their institution's IRB. 

In addition, this project also uses the Toybox dataset for evaluation purposes. The Toybox dataset is publicly available at [this address](https://aivaslab.github.io/toybox/).

## Code description

* [`temporal_classification.py`](https://github.com/eminorhan/baby-vision/blob/master/temporal_classification.py): trains temporal classification models as described in the paper. This file uses code recycled from the PyTorch ImageNet training [example](https://github.com/pytorch/examples/tree/master/imagenet).
* [`read_saycam.py`](https://github.com/eminorhan/baby-vision/blob/master/read_saycam.py): SAYCam video-to-image reader.
* [`moco`](https://github.com/eminorhan/baby-vision/tree/master/moco) directory contains helper files for training static and temporal MoCo models. The code here was modified from [Facebook's MoCo repository](https://github.com/facebookresearch/moco).
* [`moco_img.py`](https://github.com/eminorhan/baby-vision/blob/master/moco_img.py): trains an image-based MoCo model as described in the paper. This code was modified from [Facebook's MoCo repository](https://github.com/facebookresearch/moco).
* [`moco_temp.py`](https://github.com/eminorhan/baby-vision/blob/master/moco_temp.py): trains a temporal MoCo model as described in the paper. This code was also modified from [Facebook's MoCo repository](https://github.com/facebookresearch/moco).
* [`moco_utils.py`](https://github.com/eminorhan/baby-vision/blob/master/moco_utils.py): some utility functions for MoCo training.
* [`linear_decoding.py`](https://github.com/eminorhan/baby-vision/blob/master/linear_decoding.py): evaluates self-supervised models on downstream linear classification tasks.
* [`linear_combination_maps.py`](https://github.com/eminorhan/baby-vision/blob/master/linear_combination_maps.py): plots spatial attention maps as in Figure 4b and Figure 6 in the paper.
* [`highly_activating_imgs.py`](https://github.com/eminorhan/baby-vision/blob/master/highly_activating_imgs.py): finds highly activating images for a given feature as in Figure 7b in the paper.
* [`selectivities.py`](https://github.com/eminorhan/baby-vision/blob/master/selectivities.py): measures the class selecitivity indices of all features in a given layer as in Figure 7a in the paper.
* [`hog_baseline.py`](https://github.com/eminorhan/baby-vision/blob/master/hog_baseline.py): runs the HOG baseline model as described in the paper.
* [`imagenet_finetuning.py`](https://github.com/eminorhan/baby-vision/blob/master/imagenet_finetuning.py): ImageNet evaluations.
* [`feature_animation.py`](https://github.com/eminorhan/baby-vision/blob/master/feature_animation.py) and [`feature_animation_class.py`](https://github.com/eminorhan/baby-vision/blob/master/feature_animation_class.py): Some tools for visualizing the learned features.

For specific usage examples, please see the slurm scripts provided in the [`scripts`](https://github.com/eminorhan/baby-vision/tree/master/scripts) directory.

## Pre-trained models

### ResNeXt 
Since the publication of the paper, we have found that training larger capacity models for longer with the temporal classification objective significantly improves the evaluation results. Hence, we provide below pre-trained `resnext50_32x4d` type models that are currently our best models trained with the SAYCam data. We encourage people to use these new models instead of the `mobilenet_v2` type models reported in the paper (the pre-trained `mobilenet_v2` models reported in the paper are also provided below for the record). 

Four pre-trained `resnext50_32x4d` models are provided here: temporal classification models trained on data from the individual children in the SAYCam dataset (`TC-S-resnext`, `TC-A-resnext`, `TC-Y-resnext`) and a temporal classification model trained on data from all three children (`TC-SAY-resnext`). These models were all trained for 16 epochs (with batch size 256) with the following data augmentation pipeline:

```python
import torchvision.transforms as tr

tr.Compose([
	tr.RandomResizedCrop(224, scale=(0.2, 1.)),
        tr.RandomApply([tr.ColorJitter(0.9, 0.9, 0.9, 0.5)], p=0.9),
        tr.RandomGrayscale(p=0.2),
        tr.RandomApply([GaussianBlur([.1, 2.])], p=0.5),
        tr.RandomHorizontalFlip(),
        tr.ToTensor(),
        tr.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])
```

This data augmentation pipeline is similar to that used in [the SimCLR paper](https://arxiv.org/abs/2002.05709) with slightly larger random crops and slightly stronger color augmentation. Here are some evaluation results for these `resnext50_32x4d` models (to download the models, click on the links over the model names):

| Model | Toybox (*iid*) | Toybox (*exemplar*) | ImageNet (*linear*) | ImageNet (*1% ft + linear*) | 
| ----- |:--------------:|:-------------------:|:-------------------:|:---------------------------:|
| [`TC-SAY-resnext`](https://drive.google.com/file/d/1I-HvIeuupsE88yS6eff_nE6pHpEmTVPG/view?usp=sharing)  | **90.0** | **57.5** | **36.0** | **45.6** |
| [`TC-S-resnext`](https://drive.google.com/file/d/14tZeOtK1Jd64ioxPwzwf2jblriN7Jgue/view?usp=sharing)    | 88.5 | 54.9 | -- | -- |
| [`TC-A-resnext`](https://drive.google.com/file/d/1aQuWfb4O0xL0PALRJpYIUHk0tsyrujDF/view?usp=sharing)    | 86.8 | 50.4 | -- | -- |
| [`TC-Y-resnext`](https://drive.google.com/file/d/1sB12pdnVEZsgVKiVdZyS0l4x24_T5zCj/view?usp=sharing)    | 87.0 | 53.0 | -- | -- |

Here, **ImageNet (*linear*)** refers to the top-1 validation accuracy on ImageNet with only a linear classifier trained on top of the frozen features, and **ImageNet (*1% ft + linear*)** is similar but with the entire model first fine-tuned on 1% of the ImageNet training data (~12800 images). Note that these are results from a single run, so you may observe slightly different numbers.

These models come with the temporal classification heads attached. To load these models, please do something along the lines of:

```python
import torch
import torchvision.models as models

model = models.resnext50_32x4d(pretrained=False)
model.fc = torch.nn.Linear(in_features=2048, out_features=n_out, bias=True)
model = torch.nn.DataParallel(model).cuda()

checkpoint = torch.load('TC-SAY-resnext.tar')
model.load_state_dict(checkpoint['model_state_dict'])
```

where `n_out` should be 6269 for `TC-SAY-resnext`, 2765 for `TC-S-resnext`, 1786 for `TC-A-resnext`, and 1718 for `TC-Y-resnext`. The differences here are due to the different lengths of the datasets. 

In addition, please find below the best performing ImageNet models reported above: a model with a linear ImageNet classifier trained on top of the frozen features of `TC-SAY-resnext` (`TC-SAY-resnext-IN-linear`) and a model that was first fine-tuned with 1% of the ImageNet training data (`TC-SAY-resnext-IN-1pt-linear`):

* [`TC-SAY-resnext-IN-linear`](https://drive.google.com/file/d/1Qo0_1RwgOsr-JM3lP4ILWRY0WflnS7On/view?usp=sharing)
* [`TC-SAY-resnext-IN-1pt-linear`](https://drive.google.com/file/d/1lvCG3L1_-gdqWDMD41yTIbuNBpzpUOQq/view?usp=sharing)

You can load these models in the same way as described above. Since these are ImageNet models, `n_out` should be set to 1000.

### MobileNet 
The following are the pre-trained `mobilenet_v2` type models reported in the paper:

* [TC-S-mobilenet](https://drive.google.com/file/d/1DVJjpaGhoBPNmlO7jXpwEX3lSCk2ZUCa/view?usp=sharing) (69.4 MB)
* [TC-A-mobilenet](https://drive.google.com/file/d/1uQvJBbuy6P0uCW0HYs1wNgawRU8sGLhC/view?usp=sharing) (54.4 MB)
* [TC-Y-mobilenet](https://drive.google.com/file/d/1TTndiiiqSiCMdZjwYZPKQySZot4ipCrG/view?usp=sharing) (53.3 MB)
* [TC-SAY-mobilenet](https://drive.google.com/file/d/1zeidpBaXqqWCeeYj-fMI7V7x9EiAGH6Q/view?usp=sharing) (123.3 MB)

## Acknowledgments

We are very grateful to the volunteers who contributed recordings to the SAYCam dataset. We thank Jessica Sullivan for her generous assistance with the dataset. We also thank the team behind the Toybox dataset, as well as the developers of PyTorch and torchvision for making this work possible. This project was partly funded by the NSF Award 1922658 NRT-HDR: FUTURE Foundations, Translation, and Responsibility for Data Science.


================================================
FILE: feature_animation.py
================================================
'''Animating features on short clips'''
import os
import argparse
import numpy as np
import torch
import torchvision.transforms as transforms
import torchvision.datasets as datasets
import torchvision.models as models
from torchvision.utils import make_grid
import matplotlib as mp
import matplotlib.pyplot as plt
import matplotlib.animation as animation
import matplotlib.cm as cm

# TODO: combine the map extraction functions into a single function 
# TODO: combine model loading functions into a single function 

def extract_map_layer_7x7_res(res_model):
    layer_list = list(res_model.module.children())[:-2]
    new_model = torch.nn.Sequential(*layer_list)
    return new_model

def extract_map_layer_7x7(mobilenetV2_model):
    layer_list = list(mobilenetV2_model.module.features.children())
    new_model = torch.nn.Sequential(*layer_list)
    return new_model

def extract_map_layer_14x14(mobilenetV2_model):
    layer_list = list(mobilenetV2_model.module.features.children())
    new_layer_list = layer_list[:-5]
    new_layer_list.append(layer_list[-5].conv[0])
    new_model = torch.nn.Sequential(*new_layer_list)
    return new_model

def load_model_res(args):
    model = models.resnext50_32x4d(pretrained=False)
    model.fc = torch.nn.Linear(in_features=2048, out_features=args.n_out, bias=True)
    model = torch.nn.DataParallel(model).cuda()

    if args.model_path:
        if os.path.isfile(args.model_path):
            checkpoint = torch.load(args.model_path)
            model.load_state_dict(checkpoint['model_state_dict'])
        else:
            print("=> no checkpoint found at '{}'".format(args.model_path))

    return model

def load_model(args):
    model = models.mobilenet_v2(pretrained=True)
    model.classifier = torch.nn.Linear(in_features=1280, out_features=args.n_out, bias=True)
    model = torch.nn.DataParallel(model).cuda()

    if args.model_path:
        if os.path.isfile(args.model_path):
            checkpoint = torch.load(args.model_path)
            model.load_state_dict(checkpoint['model_state_dict'])
        else:
            print("=> no checkpoint found at '{}'".format(args.model_path))

    return model

def load_data(data_dir, args):

    normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
    train_dataset = datasets.ImageFolder(
        data_dir,
        transforms.Compose([transforms.Resize(224), transforms.ToTensor(), normalize])
    )

    train_loader = torch.utils.data.DataLoader(
        train_dataset, batch_size=args.batch_size, shuffle=False,
        num_workers=args.workers, pin_memory=True, sampler=None
    )

    return train_loader

def predict(data_loader, model, batch_size, feature_idx):

    # switch to evaluate mode
    model.eval()

    preds_list = []
    imgs_list = []

    with torch.no_grad():
        for i, (images, target) in enumerate(data_loader):
            images = images.cuda()

            # compute predictions
            preds = model(images)
            preds = preds[:, feature_idx, :, :]
            preds_list.append(preds)
            imgs_list.append(images)

    preds = torch.cat(preds_list, 0)
    images = torch.cat(imgs_list, 0)

    print('Images shape:', images.size())
    print('Preds shape:', preds.size())

    # Copy activation map to all channels and upsample to image size
    x = torch.zeros(preds.size()[0], 3, 7, 7)
    x[:, 0, :, :] = preds
    x[:, 1, :, :] = preds
    x[:, 2, :, :] = preds

    m = torch.nn.Upsample(scale_factor=32, mode='bicubic')

    upsampled_maps = m(x).cuda()
    # upsampled_maps = torch.sigmoid(10. * upsampled_maps / torch.std(upsampled_maps))

    upsampled_maps = upsampled_maps.cpu().numpy()
    images = images.cpu().numpy()

    return  upsampled_maps, images

def show_img(ax, img, save_name):
    '''Save maps'''
    npimg = img.cpu().numpy()

    print(npimg.shape)

    ax.imshow(np.transpose(npimg, (1, 2, 0)), interpolation='nearest')

    ax.spines["bottom"].set_visible(False)
    ax.spines["left"].set_visible(False)
    ax.spines["right"].set_visible(False)
    ax.spines["top"].set_visible(False)

    mp.rcParams['axes.linewidth'] = 0.75
    mp.rcParams['patch.linewidth'] = 0.75
    mp.rcParams['patch.linewidth'] = 1.15
    mp.rcParams['font.sans-serif'] = ['FreeSans']
    mp.rcParams['mathtext.fontset'] = 'cm'

    plt.savefig(save_name, bbox_inches='tight')


if __name__ == '__main__':

    parser = argparse.ArgumentParser(description='Plot spatial attention maps')
    parser.add_argument('data', metavar='DIR', help='path to dataset')
    parser.add_argument('--workers', default=32, type=int, help='number of data loading workers (default: 4)')
    parser.add_argument('--batch-size', default=900, type=int, help='mini-batch size, this is the total '
                                                                    'batch size of all GPUs on the current node when '
                                                                    'using Data Parallel or Distributed Data Parallel')
    parser.add_argument('--model-path', default='', type=str, help='path to model checkpoint (default: '
                                                                   'ImageNet-pretrained)')
    parser.add_argument('--n_out', default=2765, type=int, help='output dim of pre-trained model')
    parser.add_argument('--feature-idx', default=1, type=int, help='feature index for which the maps will be computed')

    args = parser.parse_args()

    model = load_model(args)
    map_layer = extract_map_layer_7x7(model)

    data_loader = load_data(args.data, args)
    preds, images = predict(data_loader, map_layer, args.batch_size, args.feature_idx)

    preds = preds - preds.min()
    preds = preds / preds.max()
    preds = np.uint8(255 * preds)

    images = images - images.min()
    images = images / images.max()
    # images = np.uint8(255 * images)

    fig, ax = plt.subplots()
    ax.set_axis_off()
    ax.set_title('Feature: ' + str(args.feature_idx))

    jet = cm.get_cmap("jet")
    jet_colors = jet(np.arange(256))[:, :3]
    preds = jet_colors[preds[:, 0, :, :]]

    masked_imgs = 1.0 * preds + np.transpose(images, (0, 2, 3, 1))
    masked_imgs = np.uint8(255 * masked_imgs / masked_imgs.max())

    imgs = []
    for i in range(900):
        im = ax.imshow(masked_imgs[i])

        if i == 0:
            im = ax.imshow(masked_imgs[i])

        imgs.append([im])

    ani = animation.ArtistAnimation(fig, imgs, interval=200, blit=True, repeat_delay=1000)

    # To save the animation, use e.g.
    ani.save('intphys_feature_animation_' + str(args.feature_idx) + '.mp4')

================================================
FILE: feature_animation_class.py
================================================
'''Animating features on short clips'''
import os
import argparse
import numpy as np
import torch
import torchvision.transforms as transforms
import torchvision.datasets as datasets
import torchvision.models as models
from torchvision.utils import make_grid
import matplotlib as mp
import matplotlib.pyplot as plt
import matplotlib.animation as animation
import matplotlib.cm as cm

# TODO: combine the map extraction functions into a single function 
# TODO: combine model loading functions into a single function 

def extract_map_layer_7x7_res(res_model):
    layer_list = list(res_model.module.children())[:-2]
    new_model = torch.nn.Sequential(*layer_list)
    return new_model

def extract_map_layer_7x7(mobilenetV2_model):
    layer_list = list(mobilenetV2_model.module.features.children())
    new_model = torch.nn.Sequential(*layer_list)
    return new_model

def extract_map_layer_14x14(mobilenetV2_model):
    layer_list = list(mobilenetV2_model.module.features.children())
    new_layer_list = layer_list[:-5]
    new_layer_list.append(layer_list[-5].conv[0])
    new_model = torch.nn.Sequential(*new_layer_list)
    return new_model

def load_model_res(args):
    model = models.resnext50_32x4d(pretrained=False)
    model.fc = torch.nn.Linear(in_features=2048, out_features=args.n_out, bias=True)
    model = torch.nn.DataParallel(model).cuda()

    if args.model_path:
        if os.path.isfile(args.model_path):
            checkpoint = torch.load(args.model_path)
            model.load_state_dict(checkpoint['model_state_dict'])
        else:
            print("=> no checkpoint found at '{}'".format(args.model_path))

    return model

def load_model(args):
    model = models.mobilenet_v2(pretrained=True)
    model.classifier = torch.nn.Linear(in_features=1280, out_features=args.n_out, bias=True)
    model = torch.nn.DataParallel(model).cuda()

    if args.model_path:
        if os.path.isfile(args.model_path):
            checkpoint = torch.load(args.model_path)
            model.load_state_dict(checkpoint['model_state_dict'])
        else:
            print("=> no checkpoint found at '{}'".format(args.model_path))

    return model

def load_data(data_dir, args):

    normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
    train_dataset = datasets.ImageFolder(
        data_dir,
        transforms.Compose([transforms.ToTensor(), normalize])
    )

    train_loader = torch.utils.data.DataLoader(
        train_dataset, batch_size=args.batch_size, shuffle=False,
        num_workers=args.workers, pin_memory=True, sampler=None
    )

    return train_loader

def predict(data_loader, model, batch_size, weights):

    # switch to evaluate mode
    model.eval()

    preds_list = []
    imgs_list = []

    with torch.no_grad():
        for i, (images, target) in enumerate(data_loader):
            images = images.cuda()

            # compute predictions
            preds = model(images)
            preds_list.append(preds)
            imgs_list.append(images)

    preds = torch.cat(preds_list, 0)
    images = torch.cat(imgs_list, 0)

    print('Images shape:', images.size())
    print('Preds shape:', preds.size())

    linear_combination_map = torch.einsum('ijkl,j->ikl', preds, weights)

    # Copy activation map to all channels and upsample to image size
    x = torch.zeros(preds.size()[0], 3, 7, 7)
    x[:, 0, :, :] = linear_combination_map
    x[:, 1, :, :] = linear_combination_map
    x[:, 2, :, :] = linear_combination_map

    m = torch.nn.Upsample(scale_factor=32, mode='bicubic')

    upsampled_maps = m(x).cuda()
    # upsampled_maps = torch.sigmoid(10. * upsampled_maps / torch.std(upsampled_maps))

    upsampled_maps = upsampled_maps.cpu().numpy()
    images = images.cpu().numpy()

    return  upsampled_maps, images

def show_img(ax, img, save_name):
    '''Save maps'''
    npimg = img.cpu().numpy()

    print(npimg.shape)

    ax.imshow(np.transpose(npimg, (1, 2, 0)), interpolation='nearest')

    ax.spines["bottom"].set_visible(False)
    ax.spines["left"].set_visible(False)
    ax.spines["right"].set_visible(False)
    ax.spines["top"].set_visible(False)

    mp.rcParams['axes.linewidth'] = 0.75
    mp.rcParams['patch.linewidth'] = 0.75
    mp.rcParams['patch.linewidth'] = 1.15
    mp.rcParams['font.sans-serif'] = ['FreeSans']
    mp.rcParams['mathtext.fontset'] = 'cm'

    plt.savefig(save_name, bbox_inches='tight')


if __name__ == '__main__':

    parser = argparse.ArgumentParser(description='Plot spatial attention maps')
    parser.add_argument('data', metavar='DIR', help='path to dataset')
    parser.add_argument('--workers', default=32, type=int, help='number of data loading workers (default: 4)')
    parser.add_argument('--batch-size', default=500, type=int, help='mini-batch size, this is the total '
                                                                    'batch size of all GPUs on the current node when '
                                                                    'using Data Parallel or Distributed Data Parallel')
    parser.add_argument('--model-path', default='', type=str, help='path to model checkpoint (default: '
                                                                   'ImageNet-pretrained)')
    parser.add_argument('--n_out', default=26, type=int, help='output dim of pre-trained model')
    parser.add_argument('--class-idx', default=1, type=int, help='class index for which the maps will be computed')

    args = parser.parse_args()

    model = load_model(args)
    map_layer = extract_map_layer_7x7(model)

    weights = model.module.classifier.weight.data[args.class_idx, :].cuda()

    data_loader = load_data(args.data, args)
    preds, images = predict(data_loader, map_layer, args.batch_size, weights)

    preds = preds - preds.min()
    preds = preds / preds.max()
    preds = np.uint8(255 * preds)

    images = images - images.min()
    images = images / images.max()
    # images = np.uint8(255 * images)

    fig, ax = plt.subplots()
    ax.set_axis_off()
    ax.set_title('Class: ' + str(args.class_idx))

    jet = cm.get_cmap("jet")
    jet_colors = jet(np.arange(256))[:, :3]
    preds = jet_colors[preds[:, 0, :, :]]

    masked_imgs = 1.0 * preds + np.transpose(images, (0, 2, 3, 1))
    masked_imgs = np.uint8(255 * masked_imgs / masked_imgs.max())

    imgs = []
    for i in range(200):
        im = ax.imshow(masked_imgs[i])

        if i == 0:
            im = ax.imshow(masked_imgs[i])

        imgs.append([im])

    ani = animation.ArtistAnimation(fig, imgs, interval=200, blit=True, repeat_delay=1000)

    # To save the animation, use e.g.
    ani.save('computers_feature_animation_' + str(args.class_idx) + '.mp4')

================================================
FILE: highly_activating_imgs.py
================================================
'''Plots highly activating images'''
import os
import argparse
import numpy as np
import torch
import torchvision.transforms as transforms
import torchvision.datasets as datasets
import torchvision.models as models
from torchvision.utils import make_grid
import matplotlib as mp
import matplotlib.pyplot as plt


def extract_map_layer_7x7(mobilenetV2_model):
    layer_list = list(mobilenetV2_model.module.features.children())
    new_model = torch.nn.Sequential(*layer_list)
    return new_model

def extract_map_layer_14x14(mobilenetV2_model, layer):
    layer_list = list(mobilenetV2_model.module.features.children())
    new_layer_list = layer_list[:-layer]
    new_layer_list.append(layer_list[-layer].conv[0])
    new_model = torch.nn.Sequential(*new_layer_list)
    return new_model

def load_model(args):
    model = models.mobilenet_v2(pretrained=True)
    model.classifier = torch.nn.Linear(in_features=1280, out_features=args.n_out, bias=True)
    model = torch.nn.DataParallel(model).cuda()

    if args.model_path:
        if os.path.isfile(args.model_path):
            checkpoint = torch.load(args.model_path)
            model.load_state_dict(checkpoint['model_state_dict'])
        else:
            print("=> no checkpoint found at '{}'".format(args.model_path))

    return model

def load_data(data_dir, args):

    normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
    train_dataset = datasets.ImageFolder(
        data_dir,
        transforms.Compose([transforms.ToTensor(), normalize])
    )

    train_loader = torch.utils.data.DataLoader(
        train_dataset, batch_size=args.batch_size, shuffle=True,
        num_workers=args.workers, pin_memory=True, sampler=None
    )

    return train_loader

def predict(data_loader, model, neuron_idx):

    # switch to evaluate mode
    model.eval()

    with torch.no_grad():
        for i, (images, target) in enumerate(data_loader):
            images = images.cuda()

            # compute predictions
            pred = model(images)
            pred_mean = torch.mean(pred, dim=(2, 3))
            pred_mean = pred_mean[:, neuron_idx]

            if i == 0:
                break

    _, indices = torch.sort(pred_mean, descending=True)
    images = images[indices, :, :, :]

    return images

def show_img(ax, img, save_name):
    '''Save maps'''
    npimg = img.cpu().numpy()

    print(npimg.shape)

    ax.imshow(np.transpose(npimg, (1, 2, 0)), interpolation='nearest')

    ax.spines["bottom"].set_visible(False)
    ax.spines["left"].set_visible(False)
    ax.spines["right"].set_visible(False)
    ax.spines["top"].set_visible(False)

    mp.rcParams['axes.linewidth'] = 0.75
    mp.rcParams['patch.linewidth'] = 0.75
    mp.rcParams['patch.linewidth'] = 1.15
    mp.rcParams['font.sans-serif'] = ['FreeSans']
    mp.rcParams['mathtext.fontset'] = 'cm'

    plt.savefig(save_name, bbox_inches='tight')


if __name__ == '__main__':

    parser = argparse.ArgumentParser(description='Plot highly activating images for a given feature')
    parser.add_argument('data', metavar='DIR', help='path to dataset')
    parser.add_argument('--workers', default=4, type=int, help='number of data loading workers (default: 4)')
    parser.add_argument('--batch-size', default=1024, type=int, help='mini-batch size, this is the total '
                                                                    'batch size of all GPUs on the current node when '
                                                                    'using Data Parallel or Distributed Data Parallel')
    parser.add_argument('--model-path', default='', type=str, help='path to latest checkpoint (default: none)')
    parser.add_argument('--n_out', default=1000, type=int, help='output dim')
    parser.add_argument('--neuron_idx', default=276, type=int, help='neuron index')

    args = parser.parse_args()

    model = load_model(args)
    map_layer = extract_map_layer_7x7(model)

    data_loader = load_data(args.data, args)
    imgs = predict(data_loader, map_layer, neuron_idx=args.neuron_idx)

    print('Imgs shape', imgs.shape)

    print('Plotting the top 10 images')
    fig_img = plt.figure(figsize=(16, 16), dpi=300)
    ax_img = fig_img.add_subplot('111')
    grid_img = make_grid(imgs[:10, :, :, :], nrow=10, padding=2, normalize=True, scale_each=False)
    show_img(ax_img, grid_img, 'highly_activating_imgs_neuron_' + str(args.neuron_idx) + '.pdf')

================================================
FILE: hog_baseline.py
================================================
'''HoG baseline'''
import os
import argparse
import numpy as np
from skimage.feature import hog
from skimage.io import imread
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import train_test_split

parser = argparse.ArgumentParser(description='Linear decoding with HoG model')
parser.add_argument('data', metavar='DIR', help='path to dataset')
parser.add_argument('--subsample', default=False, action='store_true', help='subsample data?')

if __name__ == '__main__':
    args = parser.parse_args()

    c_list = os.listdir(args.data)
    c_list.sort()
    print('Class list:', c_list)

    imgs = []
    labels = []
    label_counter = 0
    file_counter = 0

    for c in c_list:
            curr_dir = os.path.join(args.data, c)
            f_list = os.listdir(curr_dir)
            f_list.sort()

            print('Reading class:', c)

            for f in f_list:
                    f_path = os.path.join(curr_dir, f)
                    img = imread(f_path)
                    feats = hog(img, orientations=9, pixels_per_cell=(16, 16), cells_per_block=(3, 3), block_norm='L2',
                                visualize=False, transform_sqrt=False, feature_vector=True, multichannel=True)
                    if args.subsample:
                            if file_counter % 10 == 0:
                                    imgs.append(feats)
                                    labels.append(label_counter)
                    else:
                            imgs.append(feats)
                            labels.append(label_counter)

                    file_counter += 1

            label_counter += 1

    imgs = np.vstack(imgs)
    labels = np.array(labels)

    print('Imgs shape:', imgs.shape)
    print('Labels shape:', labels.shape)

    print('Splitting dataset')
    X_train, X_test, y_train, y_test = train_test_split(imgs, labels, test_size=0.5)

    print('Fitting training data')
    clf = SGDClassifier(loss="hinge", penalty="l2", alpha=0.0001, max_iter=250)
    clf.fit(X_train, y_train)

    print('Computing predictions')
    pred_test = clf.predict(X_test)
    test_acc = np.mean(y_test==pred_test)

    pred_train = clf.predict(X_train)
    train_acc = np.mean(y_train==pred_train)

    print('Test accuracy', test_acc)
    print('Train accuracy', train_acc)

================================================
FILE: imagenet_finetuning.py
================================================
import argparse
import os
import time
import warnings

import torch
import torch.nn as nn
import torch.nn.parallel
import torch.backends.cudnn as cudnn
import torch.distributed as dist
import torch.optim
import torch.multiprocessing as mp
import torch.utils.data
import torch.utils.data.distributed
import torchvision.transforms as transforms
import torchvision.datasets as datasets
import torchvision.models as models


parser = argparse.ArgumentParser(description='ImageNet fine-tuning or linear classification')
parser.add_argument('-j', '--workers', default=32, type=int, metavar='N', help='number of data loading workers (default:'
                                                                              ' 4)')
parser.add_argument('--epochs', default=25, type=int, metavar='N', help='number of total epochs to run')
parser.add_argument('--start-epoch', default=0, type=int, metavar='N', help='manual epoch number (useful on restarts)')
parser.add_argument('-b', '--batch-size', default=256, type=int, metavar='N',
                    help='mini-batch size (default: 256), this is the total batch size of all GPUs on the current node '
                         'when using Data Parallel or Distributed Data Parallel')
parser.add_argument('--lr', '--learning-rate', default=0.0005, type=float, metavar='LR', help='initial learning rate',
                    dest='lr')
parser.add_argument('--wd', '--weight-decay', default=0.0, type=float, metavar='W', help='weight decay (default: 0)',
                    dest='weight_decay')
parser.add_argument('-p', '--print-freq', default=5000, type=int, metavar='N', help='print frequency (default: 100)')
parser.add_argument('--schedule', default=[23, 24], nargs='*', type=int,
                    help='learning rate schedule (when to drop lr by a ratio)')
parser.add_argument('--resume', default='', type=str, metavar='PATH', help='path to latest checkpoint (default: none)')
parser.add_argument('--world-size', default=-1, type=int, help='number of nodes for distributed training')
parser.add_argument('--rank', default=-1, type=int, help='node rank for distributed training')
parser.add_argument('--dist-url', default='tcp://224.66.41.62:23456', type=str, help='url used to set up distributed '
                                                                                     'training')
parser.add_argument('--dist-backend', default='nccl', type=str, help='distributed backend')
parser.add_argument('--gpu', default=None, type=int, help='GPU id to use.')
parser.add_argument('--multiprocessing-distributed', action='store_true',
                    help='Use multi-processing distributed training to launch '
                         'N processes per node, which has N GPUs. This is the '
                         'fastest way to use PyTorch for either single node or '
                         'multi node data parallel training')
parser.add_argument('--n_out', default=20, type=int, help='output dim')
parser.add_argument('--freeze-trunk', default=False, action='store_true', help='freeze trunk?')
parser.add_argument('--frac-retained', default=1.0, type=float, help='fraction of tr data retained')


def set_parameter_requires_grad(model, feature_extracting=True):
    '''Helper function for setting body to non-trainable'''
    if feature_extracting:
        for param in model.parameters():
            param.requires_grad = False

        for param in model.module.fc.parameters():
            print(param.shape)
            param.requires_grad = True

def main():
    args = parser.parse_args()

    if args.gpu is not None:
        warnings.warn('You have chosen a specific GPU. This will completely disable data parallelism.')

    if args.dist_url == "env://" and args.world_size == -1:
        args.world_size = int(os.environ["WORLD_SIZE"])

    args.distributed = args.world_size > 1 or args.multiprocessing_distributed

    ngpus_per_node = torch.cuda.device_count()
    if args.multiprocessing_distributed:
        # Since we have ngpus_per_node processes per node, the total world_size needs to be adjusted accordingly
        args.world_size = ngpus_per_node * args.world_size
        # Use torch.multiprocessing.spawn to launch distributed processes: the main_worker process function
        mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args))
    else:
        # Simply call main_worker function
        main_worker(args.gpu, ngpus_per_node, args)


def main_worker(gpu, ngpus_per_node, args):
    args.gpu = gpu

    if args.gpu is not None:
        print("Use GPU: {} for training".format(args.gpu))

    if args.distributed:
        if args.dist_url == "env://" and args.rank == -1:
            args.rank = int(os.environ["RANK"])
        if args.multiprocessing_distributed:
            # For multiprocessing distributed training, rank needs to be the
            # global rank among all the processes
            args.rank = args.rank * ngpus_per_node + gpu
        dist.init_process_group(backend=args.dist_backend, init_method=args.dist_url,
                                world_size=args.world_size, rank=args.rank)

    model = models.resnext50_32x4d(pretrained=False)
    model.fc = torch.nn.Linear(in_features=2048, out_features=args.n_out, bias=True)

    # DataParallel will divide and allocate batch_size to all available GPUs
    model = torch.nn.DataParallel(model).cuda()

    # if resume from a pretrained model
    if args.resume:
        if os.path.isfile(args.resume):
            print("=> loading model '{}'".format(args.resume))
            checkpoint = torch.load(args.resume)
            model.load_state_dict(checkpoint['model_state_dict'])
            if args.freeze_trunk:
                print('Freezing trunk.')
                set_parameter_requires_grad(model)  # freeze the trunk
            model.module.fc = torch.nn.Linear(in_features=2048, out_features=1000, bias=True).cuda()
        else:
            print("=> no checkpoint found at '{}'".format(args.resume))
    else:
        if args.freeze_trunk:
            print('Freezing trunk.')
            set_parameter_requires_grad(model)  # freeze the trunk
            model.module.fc = torch.nn.Linear(in_features=2048, out_features=1000, bias=True).cuda()

    print(model)

    # define loss function (criterion) and optimizer
    criterion = nn.CrossEntropyLoss().cuda(args.gpu)
    optimizer = torch.optim.Adam(model.parameters(), args.lr, weight_decay=args.weight_decay)
    cudnn.benchmark = True

    # Save file name
    if args.resume:
        sv_name = args.resume
        savefile_name = 'ft_IN_' + sv_name  # str(args.freeze_trunk) + 'fz_IN_' + sv_name[26:]
    else:
        savefile_name = str(args.freeze_trunk) + 'fz_IN_MobileNetV2_scratch.tar'

    # Data loaders
    basedir = '/misc/vlgscratch4/LakeGroup/emin/robust_vision/imagenet/'

    traindir = os.path.join(basedir, 'train')
    valdir = os.path.join(basedir, 'val')

    normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                     std=[0.229, 0.224, 0.225])

    train_dataset = datasets.ImageFolder(
        traindir,
        transforms.Compose([
            transforms.RandomResizedCrop(224),
            transforms.RandomHorizontalFlip(),
            transforms.ToTensor(),
            normalize
        ])
    )

    val_dataset = datasets.ImageFolder(valdir, transforms.Compose([
            transforms.Resize(256),
            transforms.CenterCrop(224),
            transforms.ToTensor(),
            normalize
        ]))

    if args.frac_retained < 1.0:
        print('Fraction of train data retained:', args.frac_retained)

        import numpy as np
        num_train = len(train_dataset)
        indices = list(range(num_train))
        np.random.shuffle(indices)
        train_idx = indices[:int(args.frac_retained * num_train)]
        train_sampler = torch.utils.data.sampler.SubsetRandomSampler(train_idx)

        train_loader = torch.utils.data.DataLoader(
            train_dataset, batch_size=args.batch_size, shuffle=False,
            num_workers=args.workers, pin_memory=True, sampler=train_sampler)
    else:
        print('Using all of train data')

        train_loader = torch.utils.data.DataLoader(
            train_dataset, batch_size=args.batch_size, shuffle=True,
            num_workers=args.workers, pin_memory=True, sampler=None)

    val_loader = torch.utils.data.DataLoader(
        val_dataset, batch_size=args.batch_size, shuffle=False,
        num_workers=args.workers, pin_memory=True)

    acc1_list = []
    val_acc1_list = []

    for epoch in range(args.start_epoch, args.epochs):

        adjust_learning_rate(optimizer, epoch, args)

        # train for one epoch
        acc1 = train(train_loader, model, criterion, optimizer, epoch, args)
        acc1_list.append(acc1)

        # ... then validate
        val_acc1 = validate(val_loader, model, args)
        val_acc1_list.append(val_acc1)

    torch.save({'acc1_list': acc1_list,
                'val_acc1_list': val_acc1_list,
                'model_state_dict': model.state_dict(),
                'optimizer_state_dict': optimizer.state_dict()}, savefile_name)


def train(train_loader, model, criterion, optimizer, epoch, args):
    batch_time = AverageMeter('Time', ':6.3f')
    data_time = AverageMeter('Data', ':6.3f')
    losses = AverageMeter('Loss', ':.4e')
    top1 = AverageMeter('Acc@1', ':6.2f')
    top5 = AverageMeter('Acc@5', ':6.2f')
    progress = ProgressMeter(
        len(train_loader),
        [batch_time, data_time, losses, top1, top5],
        prefix="Epoch: [{}]".format(epoch))

    # switch to train mode
    model.train()

    end = time.time()
    for i, (images, target) in enumerate(train_loader):
        # measure data loading time
        data_time.update(time.time() - end)

        if args.gpu is not None:
            images = images.cuda(args.gpu, non_blocking=True)
        target = target.cuda(args.gpu, non_blocking=True)

        # compute output
        output = model(images)
        loss = criterion(output, target)

        # measure accuracy and record loss
        acc1, acc5 = accuracy(output, target, topk=(1, 5))
        losses.update(loss.item(), images.size(0))
        top1.update(acc1[0], images.size(0))
        top5.update(acc5[0], images.size(0))

        # compute gradient and do SGD step
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # measure elapsed time
        batch_time.update(time.time() - end)
        end = time.time()

        if i % args.print_freq == 0:
            progress.display(i)

    return top1.avg.cpu().numpy()


def validate(val_loader, model, args):
    top1 = AverageMeter('Acc@1', ':6.2f')
    top5 = AverageMeter('Acc@5', ':6.2f')

    # switch to eval mode
    model.eval()

    with torch.no_grad():
        for i, (images, target) in enumerate(val_loader):

            if args.gpu is not None:
                images = images.cuda(args.gpu, non_blocking=True)
            target = target.cuda(args.gpu, non_blocking=True)

            # compute output
            output = model(images)

            # measure accuracy and record loss
            acc1, acc5 = accuracy(output, target, topk=(1, 5))
            top1.update(acc1[0], images.size(0))
            top5.update(acc5[0], images.size(0))

    print('End of epoch validation: * Acc@1 {top1.avg:.3f} Acc@5 {top5.avg:.3f}'.format(top1=top1, top5=top5))

    return top1.avg.cpu().numpy()


class AverageMeter(object):
    """Computes and stores the average and current value"""
    def __init__(self, name, fmt=':f'):
        self.name = name
        self.fmt = fmt
        self.reset()

    def reset(self):
        self.val = 0
        self.avg = 0
        self.sum = 0
        self.count = 0

    def update(self, val, n=1):
        self.val = val
        self.sum += val * n
        self.count += n
        self.avg = self.sum / self.count

    def __str__(self):
        fmtstr = '{name} {val' + self.fmt + '} ({avg' + self.fmt + '})'
        return fmtstr.format(**self.__dict__)


class ProgressMeter(object):
    def __init__(self, num_batches, meters, prefix=""):
        self.batch_fmtstr = self._get_batch_fmtstr(num_batches)
        self.meters = meters
        self.prefix = prefix

    def display(self, batch):
        entries = [self.prefix + self.batch_fmtstr.format(batch)]
        entries += [str(meter) for meter in self.meters]
        print('\t'.join(entries))

    def _get_batch_fmtstr(self, num_batches):
        num_digits = len(str(num_batches // 1))
        fmt = '{:' + str(num_digits) + 'd}'
        return '[' + fmt + '/' + fmt.format(num_batches) + ']'


def adjust_learning_rate(optimizer, epoch, args):
    """Decay the learning rate based on schedule"""
    lr = args.lr
    for milestone in args.schedule:
        lr *= 0.2 if epoch >= milestone else 1.
    for param_group in optimizer.param_groups:
        param_group['lr'] = lr

def accuracy(output, target, topk=(1,)):
    """Computes the accuracy over the k top predictions for the specified values of k"""
    with torch.no_grad():
        maxk = max(topk)
        batch_size = target.size(0)

        _, pred = output.topk(maxk, 1, True, True)
        pred = pred.t()
        correct = pred.eq(target.view(1, -1).expand_as(pred))

        res = []
        for k in topk:
            correct_k = correct[:k].view(-1).float().sum(0, keepdim=True)
            res.append(correct_k.mul_(100.0 / batch_size))
        return res


if __name__ == '__main__':
    main()

================================================
FILE: linear_combination_maps.py
================================================
'''Plots spatial attention maps'''
import os
import argparse
import numpy as np
import torch
import torchvision.transforms as transforms
import torchvision.datasets as datasets
import torchvision.models as models
from torchvision.utils import make_grid
import matplotlib as mp
import matplotlib.pyplot as plt

def extract_map_layer_7x7(mobilenetV2_model):
    layer_list = list(mobilenetV2_model.module.features.children())
    new_model = torch.nn.Sequential(*layer_list)
    return new_model

def extract_map_layer_14x14(mobilenetV2_model):
    layer_list = list(mobilenetV2_model.module.features.children())
    new_layer_list = layer_list[:-5]
    new_layer_list.append(layer_list[-5].conv[0])
    new_model = torch.nn.Sequential(*new_layer_list)
    return new_model

def load_model(args):
    model = models.mobilenet_v2(pretrained=True)
    model.classifier = torch.nn.Linear(in_features=1280, out_features=args.n_out, bias=True)
    model = torch.nn.DataParallel(model).cuda()

    if args.model_path:
        if os.path.isfile(args.model_path):
            checkpoint = torch.load(args.model_path)
            model.load_state_dict(checkpoint['model_state_dict'])
        else:
            print("=> no checkpoint found at '{}'".format(args.model_path))

    return model

def load_data(data_dir, args):

    normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
    train_dataset = datasets.ImageFolder(
        data_dir,
        transforms.Compose([transforms.ToTensor(), normalize])
    )

    train_loader = torch.utils.data.DataLoader(
        train_dataset, batch_size=args.batch_size, shuffle=True,
        num_workers=args.workers, pin_memory=True, sampler=None
    )

    return train_loader

def predict(data_loader, model, weights, batch_size):

    # switch to evaluate mode
    model.eval()

    with torch.no_grad():
        for i, (images, target) in enumerate(data_loader):
            images = images.cuda()

            print(images.size())

            # compute predictions
            pred = model(images)

            if i == 0:
                break

    linear_combination_map = torch.einsum('ijkl,j->ikl', pred, weights)

    x = torch.zeros(batch_size, 3, 7, 7)
    x[:, 0, :, :] = linear_combination_map
    x[:, 1, :, :] = linear_combination_map
    x[:, 2, :, :] = linear_combination_map

    m = torch.nn.Upsample(scale_factor=32, mode='bicubic')
    mm = m(x).cuda()
    mm = torch.sigmoid(10. * mm / torch.std(mm))

    return mm * images

def show_img(ax, img, save_name):
    '''Save maps'''
    npimg = img.cpu().numpy()

    print(npimg.shape)

    ax.imshow(np.transpose(npimg, (1, 2, 0)), interpolation='nearest')

    ax.spines["bottom"].set_visible(False)
    ax.spines["left"].set_visible(False)
    ax.spines["right"].set_visible(False)
    ax.spines["top"].set_visible(False)

    mp.rcParams['axes.linewidth'] = 0.75
    mp.rcParams['patch.linewidth'] = 0.75
    mp.rcParams['patch.linewidth'] = 1.15
    mp.rcParams['font.sans-serif'] = ['FreeSans']
    mp.rcParams['mathtext.fontset'] = 'cm'

    plt.savefig(save_name, bbox_inches='tight')


if __name__ == '__main__':

    parser = argparse.ArgumentParser(description='Plot spatial attention maps')
    parser.add_argument('data', metavar='DIR', help='path to dataset')
    parser.add_argument('--workers', default=4, type=int, help='number of data loading workers (default: 4)')
    parser.add_argument('--batch-size', default=36, type=int, help='mini-batch size, this is the total '
                                                                    'batch size of all GPUs on the current node when '
                                                                    'using Data Parallel or Distributed Data Parallel')
    parser.add_argument('--model-path', default='', type=str, help='path to model checkpoint (default: '
                                                                   'ImageNet-pretrained)')
    parser.add_argument('--n_out', default=1000, type=int, help='output dim of pre-trained model')
    parser.add_argument('--class-idx', default=6, type=int, help='class index for which the maps will be computed')

    args = parser.parse_args()

    model = load_model(args)
    map_layer = extract_map_layer_7x7(model)

    weights = model.module.classifier.weight.data[args.class_idx, :].cuda()

    data_loader = load_data(args.data, args)
    preds = predict(data_loader, map_layer, weights, args.batch_size)

    print('Preds shape:', preds.shape)

    fig_pred = plt.figure(figsize=(16, 16), dpi=300)
    ax_pred = fig_pred.add_subplot('111')
    grid_pred = make_grid(preds, nrow=12, padding=1, normalize=True, scale_each=False)
    show_img(ax_pred, grid_pred, 'linear_combination_maps_class_' + str(args. class_idx) + '.pdf')

================================================
FILE: linear_decoding.py
================================================
import argparse
import os
import random
import shutil
import time
import warnings
import numpy as np

import torch
import torch.nn as nn
import torch.nn.parallel
import torch.backends.cudnn as cudnn
import torch.distributed as dist
import torch.optim
import torch.multiprocessing as mp
import torch.utils.data
import torch.utils.data.distributed
import torchvision.transforms as transforms
import torchvision.datasets as datasets
import torchvision.models as models


parser = argparse.ArgumentParser(description='Linear decoding with headcam data')
parser.add_argument('data', metavar='DIR', help='path to dataset')
parser.add_argument('-j', '--workers', default=32, type=int, metavar='N', help='number of data loading workers (default: 32)')
parser.add_argument('--epochs', default=100, type=int, metavar='N', help='number of total epochs to run')
parser.add_argument('--start-epoch', default=0, type=int, metavar='N', help='manual epoch number (useful on restarts)')
parser.add_argument('-b', '--batch-size', default=256, type=int, metavar='N',
                    help='mini-batch size (default: 1024), this is the total batch size of all GPUs on the current node '
                         'when using Data Parallel or Distributed Data Parallel')
parser.add_argument('--lr', '--learning-rate', default=0.0005, type=float, metavar='LR', help='initial learning rate', dest='lr')
parser.add_argument('--wd', '--weight-decay', default=0.0, type=float, metavar='W', help='weight decay (default: 0)', dest='weight_decay')
parser.add_argument('-p', '--print-freq', default=100, type=int, metavar='N', help='print frequency (default: 100)')
parser.add_argument('--world-size', default=-1, type=int, help='number of nodes for distributed training')
parser.add_argument('--rank', default=-1, type=int, help='node rank for distributed training')
parser.add_argument('--dist-url', default='tcp://224.66.41.62:23456', type=str, help='url used to set up distributed training')
parser.add_argument('--dist-backend', default='nccl', type=str, help='distributed backend')
parser.add_argument('--gpu', default=None, type=int, help='GPU id to use.')
parser.add_argument('--multiprocessing-distributed', action='store_true',
                    help='Use multi-processing distributed training to launch '
                         'N processes per node, which has N GPUs. This is the '
                         'fastest way to use PyTorch for either single node or '
                         'multi node data parallel training')
parser.add_argument('--model-name', type=str, default='random',
                    choices=['random', 'imagenet', 'TC-S', 'TC-A', 'TC-Y', 'TC-SAY', 'moco_img_0011', 'moco_temp_0011'],
                    help='evaluated model')
parser.add_argument('--num-outs', default=16127, type=int, help='number of outputs in pretrained model')
parser.add_argument('--num-classes', default=26, type=int, help='number of classes in downstream classification task')
parser.add_argument('--subsample', default=False, action='store_true', help='subsample data?')


def set_parameter_requires_grad(model, feature_extracting=True):
    '''Helper function for setting body to non-trainable'''
    if feature_extracting:
        for param in model.parameters():
            param.requires_grad = False


def load_split_train_test(datadir, args, train_frac=0.5):

    import numpy as np
    normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                     std=[0.229, 0.224, 0.225])

    train_data = datasets.ImageFolder(datadir, transform=transforms.Compose([transforms.ToTensor(), normalize]))
    test_data = datasets.ImageFolder(datadir, transform=transforms.Compose([transforms.ToTensor(), normalize]))

    num_train = len(train_data)

    print('Total data size is', num_train)

    indices = list(range(num_train))
    split = int(np.floor(train_frac * num_train))
    np.random.shuffle(indices)

    if args.subsample:
        num_data = int(0.1 * num_train)
        train_idx, test_idx = indices[:(num_data // 2)], indices[(num_data // 2):num_data]
    else:
        train_idx, test_idx = indices[:split], indices[split:]

    print('Training data size is', len(train_idx))
    print('Test data size is', len(test_idx))

    train_sampler = torch.utils.data.sampler.SubsetRandomSampler(train_idx)
    test_sampler = torch.utils.data.sampler.SubsetRandomSampler(test_idx)

    trainloader = torch.utils.data.DataLoader(train_data, batch_size=args.batch_size, shuffle=False,
        num_workers=args.workers, pin_memory=True, sampler=train_sampler)
    testloader = torch.utils.data.DataLoader(test_data, batch_size=args.batch_size, shuffle=False,
        num_workers=args.workers, pin_memory=True, sampler=test_sampler)

    return trainloader, testloader


def main():
    args = parser.parse_args()

    if args.gpu is not None:
        warnings.warn('You have chosen a specific GPU. This will completely disable data parallelism.')

    if args.dist_url == "env://" and args.world_size == -1:
        args.world_size = int(os.environ["WORLD_SIZE"])

    args.distributed = args.world_size > 1 or args.multiprocessing_distributed

    ngpus_per_node = torch.cuda.device_count()
    if args.multiprocessing_distributed:
        # Since we have ngpus_per_node processes per node, the total world_size needs to be adjusted accordingly
        args.world_size = ngpus_per_node * args.world_size
        # Use torch.multiprocessing.spawn to launch distributed processes: the main_worker process function
        mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args))
    else:
        # Simply call main_worker function
        main_worker(args.gpu, ngpus_per_node, args)


def main_worker(gpu, ngpus_per_node, args):
    args.gpu = gpu

    if args.gpu is not None:
        print("Use GPU: {} for training".format(args.gpu))

    # model definition
    num_classes = args.num_classes

    if args.model_name == 'random':
        model = models.mobilenet_v2(pretrained=False)
        set_parameter_requires_grad(model)
        model.classifier = torch.nn.Linear(in_features=1280, out_features=num_classes, bias=True)
        model = torch.nn.DataParallel(model).cuda()
    elif args.model_name == 'imagenet':
        model = models.mobilenet_v2(pretrained=True)
        set_parameter_requires_grad(model)
        model.classifier = torch.nn.Linear(in_features=1280, out_features=num_classes, bias=True)
        model = torch.nn.DataParallel(model).cuda()
    elif args.model_name.startswith('moco'):
        model = models.mobilenet_v2(pretrained=False)
        model.classifier = torch.nn.Linear(in_features=1280, out_features=args.num_outs, bias=True)
        checkpoint = torch.load('../self_supervised_models/' + args.model_name + '.pth.tar')

        # rename moco pre-trained keys
        state_dict = checkpoint['state_dict']
        for k in list(state_dict.keys()):
            # retain only encoder_q up to before the embedding layer
            if k.startswith('module.encoder_q') and not k.startswith('module.encoder_q.classifier'):
                # remove prefix
                state_dict[k[len("module.encoder_q."):]] = state_dict[k]
            # delete renamed or unused k
            del state_dict[k]

        msg = model.load_state_dict(state_dict, strict=False)
        assert set(msg.missing_keys) == {"classifier.weight", "classifier.bias"}

        print("=> loaded pre-trained model '{}'".format(args.model_name))

        set_parameter_requires_grad(model)  # freeze the trunk
        model.classifier = torch.nn.Linear(in_features=1280, out_features=num_classes, bias=True)
        model = torch.nn.DataParallel(model).cuda()
    else:
        model = models.resnext50_32x4d(pretrained=False)
        model.fc = torch.nn.Linear(in_features=2048, out_features=args.num_outs, bias=True)
        model = torch.nn.DataParallel(model).cuda()
        checkpoint = torch.load(args.model_name + '.tar')
        model.load_state_dict(checkpoint['model_state_dict'])
        set_parameter_requires_grad(model)  # freeze the trunk
        model.module.fc = torch.nn.Linear(in_features=2048, out_features=num_classes, bias=True).cuda()

    # define loss function (criterion) and optimizer
    criterion = nn.CrossEntropyLoss().cuda(args.gpu)
    optimizer = torch.optim.Adam(model.parameters(), args.lr, weight_decay=args.weight_decay)
    cudnn.benchmark = True

    # Data loading code
    savefile_name = args.model_name + '_labeledS_5_iid.tar'

    train_loader, test_loader = load_split_train_test(args.data, args)
    acc1_list = []
    val_acc1_list = []

    for epoch in range(args.start_epoch, args.epochs):

        # train for one epoch
        acc1 = train(train_loader, model, criterion, optimizer, epoch, args)
        acc1_list.append(acc1)

    # validate at end of epoch
    val_acc1, preds, target, images = validate(test_loader, model, args)
    val_acc1_list.append(val_acc1)

    torch.save({'acc1_list': acc1_list,
                'val_acc1_list': val_acc1_list,
                'model_state_dict': model.state_dict(),
                'optimizer_state_dict': optimizer.state_dict(),
                'preds': preds,
                'target': target,
                'images': images
                }, savefile_name)


def train(train_loader, model, criterion, optimizer, epoch, args):
    batch_time = AverageMeter('Time', ':6.3f')
    data_time = AverageMeter('Data', ':6.3f')
    losses = AverageMeter('Loss', ':.4e')
    top1 = AverageMeter('Acc@1', ':6.2f')
    top5 = AverageMeter('Acc@5', ':6.2f')
    progress = ProgressMeter(
        len(train_loader),
        [batch_time, data_time, losses, top1, top5],
        prefix="Epoch: [{}]".format(epoch))

    # switch to train mode
    model.train()

    end = time.time()
    for i, (images, target) in enumerate(train_loader):
        # measure data loading time
        data_time.update(time.time() - end)

        if args.gpu is not None:
            images = images.cuda(args.gpu, non_blocking=True)
        target = target.cuda(args.gpu, non_blocking=True)

        # compute output
        output = model(images)
        loss = criterion(output, target)

        # measure accuracy and record loss
        acc1, acc5 = accuracy(output, target, topk=(1, 2))
        losses.update(loss.item(), images.size(0))
        top1.update(acc1[0], images.size(0))
        top5.update(acc5[0], images.size(0))

        # compute gradient and do SGD step
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # for param in model.parameters():
        #     print(param.requires_grad)

        # measure elapsed time
        batch_time.update(time.time() - end)
        end = time.time()

        if i % args.print_freq == 0:
            progress.display(i)

    return top1.avg.cpu().numpy()


def validate(val_loader, model, args):
    batch_time = AverageMeter('Time', ':6.3f')
    top1 = AverageMeter('Acc@1', ':6.2f')

    # switch to evaluate mode
    model.eval()

    with torch.no_grad():
        end = time.time()
        for i, (images, target) in enumerate(val_loader):
            if args.gpu is not None:
                images = images.cuda(args.gpu, non_blocking=True)
            target = target.cuda(args.gpu, non_blocking=True)

            # compute output
            output = model(images)

            preds = np.argmax(output.cpu().numpy(), axis=1)

            # measure accuracy and record loss
            acc1 = accuracy(output, target, topk=(1, ))
            top1.update(acc1[0].cpu().numpy()[0], images.size(0))

            # measure elapsed time
            batch_time.update(time.time() - end)
            end = time.time()

        print('* Acc@1 {top1.avg:.3f} '.format(top1=top1))

    return top1.avg, preds, target.cpu().numpy(), images.cpu().numpy()


class AverageMeter(object):
    """Computes and stores the average and current value"""
    def __init__(self, name, fmt=':f'):
        self.name = name
        self.fmt = fmt
        self.reset()

    def reset(self):
        self.val = 0
        self.avg = 0
        self.sum = 0
        self.count = 0

    def update(self, val, n=1):
        self.val = val
        self.sum += val * n
        self.count += n
        self.avg = self.sum / self.count

    def __str__(self):
        fmtstr = '{name} {val' + self.fmt + '} ({avg' + self.fmt + '})'
        return fmtstr.format(**self.__dict__)


class ProgressMeter(object):
    def __init__(self, num_batches, meters, prefix=""):
        self.batch_fmtstr = self._get_batch_fmtstr(num_batches)
        self.meters = meters
        self.prefix = prefix

    def display(self, batch):
        entries = [self.prefix + self.batch_fmtstr.format(batch)]
        entries += [str(meter) for meter in self.meters]
        print('\t'.join(entries))

    def _get_batch_fmtstr(self, num_batches):
        num_digits = len(str(num_batches // 1))
        fmt = '{:' + str(num_digits) + 'd}'
        return '[' + fmt + '/' + fmt.format(num_batches) + ']'


def accuracy(output, target, topk=(1,)):
    """Computes the accuracy over the k top predictions for the specified values of k"""
    with torch.no_grad():
        maxk = max(topk)
        batch_size = target.size(0)

        _, pred = output.topk(maxk, 1, True, True)
        pred = pred.t()
        correct = pred.eq(target.view(1, -1).expand_as(pred))

        res = []
        for k in topk:
            correct_k = correct[:k].view(-1).float().sum(0, keepdim=True)
            res.append(correct_k.mul_(100.0 / batch_size))
        return res


if __name__ == '__main__':
    main()

================================================
FILE: moco/__init__.py
================================================
# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved


================================================
FILE: moco/builder.py
================================================
# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved
import torch
import torch.nn as nn


class MoCo(nn.Module):
    """
    Build a MoCo model with: a query encoder, a key encoder, and a queue
    https://arxiv.org/abs/1911.05722
    """
    def __init__(self, base_encoder, dim=128, K=65536, m=0.999, T=0.07, mlp=False):
        """
        dim: feature dimension (default: 128)
        K: queue size; number of negative keys (default: 65536)
        m: moco momentum of updating key encoder (default: 0.999)
        T: softmax temperature (default: 0.07)
        """
        super(MoCo, self).__init__()

        self.K = K
        self.m = m
        self.T = T

        # create the encoders
        # num_classes is the output fc dimension
        self.encoder_q = base_encoder(num_classes=dim)
        self.encoder_k = base_encoder(num_classes=dim)

        # self.encoder_q.classifier = self.encoder_q.classifier[-1]  # remove dropout (only for mobilenet_v2)
        # self.encoder_k.classifier = self.encoder_k.classifier[-1]  # remove dropout (only for mobilenet_v2)

        if mlp:  # hack: brute-force replacement
            dim_mlp = self.encoder_q.fc.weight.shape[1]
            self.encoder_q.fc = nn.Sequential(nn.Linear(dim_mlp, dim_mlp), nn.ReLU(), self.encoder_q.fc)
            self.encoder_k.fc = nn.Sequential(nn.Linear(dim_mlp, dim_mlp), nn.ReLU(), self.encoder_k.fc)

        for param_q, param_k in zip(self.encoder_q.parameters(), self.encoder_k.parameters()):
            param_k.data.copy_(param_q.data)  # initialize
            param_k.requires_grad = False  # not update by gradient

        # create the queue
        self.register_buffer("queue", torch.randn(dim, K))
        self.queue = nn.functional.normalize(self.queue, dim=0)

        self.register_buffer("queue_ptr", torch.zeros(1, dtype=torch.long))

    @torch.no_grad()
    def _momentum_update_key_encoder(self):
        """
        Momentum update of the key encoder
        """
        for param_q, param_k in zip(self.encoder_q.parameters(), self.encoder_k.parameters()):
            param_k.data = param_k.data * self.m + param_q.data * (1. - self.m)

    @torch.no_grad()
    def _dequeue_and_enqueue(self, keys):
        # gather keys before updating queue
        keys = concat_all_gather(keys)

        batch_size = keys.shape[0]

        ptr = int(self.queue_ptr)
        assert self.K % batch_size == 0  # for simplicity

        # replace the keys at ptr (dequeue and enqueue)
        self.queue[:, ptr:ptr + batch_size] = keys.T
        ptr = (ptr + batch_size) % self.K  # move pointer

        self.queue_ptr[0] = ptr

    @torch.no_grad()
    def _batch_shuffle_ddp(self, x):
        """
        Batch shuffle, for making use of BatchNorm.
        *** Only support DistributedDataParallel (DDP) model. ***
        """
        # gather from all gpus
        batch_size_this = x.shape[0]
        x_gather = concat_all_gather(x)
        batch_size_all = x_gather.shape[0]

        num_gpus = batch_size_all // batch_size_this

        # random shuffle index
        idx_shuffle = torch.randperm(batch_size_all).cuda()

        # broadcast to all gpus
        torch.distributed.broadcast(idx_shuffle, src=0)

        # index for restoring
        idx_unshuffle = torch.argsort(idx_shuffle)

        # shuffled index for this gpu
        gpu_idx = torch.distributed.get_rank()
        idx_this = idx_shuffle.view(num_gpus, -1)[gpu_idx]

        return x_gather[idx_this], idx_unshuffle

    @torch.no_grad()
    def _batch_unshuffle_ddp(self, x, idx_unshuffle):
        """
        Undo batch shuffle.
        *** Only support DistributedDataParallel (DDP) model. ***
        """
        # gather from all gpus
        batch_size_this = x.shape[0]
        x_gather = concat_all_gather(x)
        batch_size_all = x_gather.shape[0]

        num_gpus = batch_size_all // batch_size_this

        # restored index for this gpu
        gpu_idx = torch.distributed.get_rank()
        idx_this = idx_unshuffle.view(num_gpus, -1)[gpu_idx]

        return x_gather[idx_this]

    def forward(self, im_q, im_k):
        """
        Input:
            im_q: a batch of query images
            im_k: a batch of key images
        Output:
            logits, targets
        """

        # compute query features
        q = self.encoder_q(im_q)  # queries: NxC
        q = nn.functional.normalize(q, dim=1)

        # compute key features
        with torch.no_grad():  # no gradient to keys
            self._momentum_update_key_encoder()  # update the key encoder

            # shuffle for making use of BN
            im_k, idx_unshuffle = self._batch_shuffle_ddp(im_k)

            k = self.encoder_k(im_k)  # keys: NxC
            k = nn.functional.normalize(k, dim=1)

            # undo shuffle
            k = self._batch_unshuffle_ddp(k, idx_unshuffle)

        # compute logits
        # Einstein sum is more intuitive
        # positive logits: Nx1
        l_pos = torch.einsum('nc,nc->n', [q, k]).unsqueeze(-1)
        # negative logits: NxK
        l_neg = torch.einsum('nc,ck->nk', [q, self.queue.clone().detach()])

        # logits: Nx(1+K)
        logits = torch.cat([l_pos, l_neg], dim=1)

        # apply temperature
        logits /= self.T

        # labels: positive key indicators
        labels = torch.zeros(logits.shape[0], dtype=torch.long).cuda()

        # dequeue and enqueue
        self._dequeue_and_enqueue(k)

        return logits, labels


# utils
@torch.no_grad()
def concat_all_gather(tensor):
    """
    Performs all_gather operation on the provided tensors.
    *** Warning ***: torch.distributed.all_gather has no gradient.
    """
    tensors_gather = [torch.ones_like(tensor)
        for _ in range(torch.distributed.get_world_size())]
    torch.distributed.all_gather(tensors_gather, tensor, async_op=False)

    output = torch.cat(tensors_gather, dim=0)
    return output


================================================
FILE: moco/loader.py
================================================
# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved
from PIL import ImageFilter
import random


class TwoCropsTransform:
    """Take two random crops of one image as the query and key."""

    def __init__(self, base_transform):
        self.base_transform = base_transform

    def __call__(self, x):
        q = self.base_transform(x)
        k = self.base_transform(x)
        return [q, k]


class GaussianBlur(object):
    """Gaussian blur augmentation in SimCLR https://arxiv.org/abs/2002.05709"""

    def __init__(self, sigma=[.1, 2.]):
        self.sigma = sigma

    def __call__(self, x):
        sigma = random.uniform(self.sigma[0], self.sigma[1])
        x = x.filter(ImageFilter.GaussianBlur(radius=sigma))
        return x


================================================
FILE: moco_img.py
================================================
#!/usr/bin/env python
# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved
import argparse
import builtins
import math
import os
import random
import shutil
import time
import warnings

import torch
import torch.nn as nn
import torch.nn.parallel
import torch.backends.cudnn as cudnn
import torch.distributed as dist
import torch.optim
import torch.multiprocessing as mp
import torch.utils.data
import torch.utils.data.distributed
import torchvision.transforms as transforms
import torchvision.datasets as datasets
import torchvision.models as models

import moco.loader
import moco.builder

model_names = sorted(name for name in models.__dict__
    if name.islower() and not name.startswith("__")
    and callable(models.__dict__[name]))

parser = argparse.ArgumentParser(description='PyTorch ImageNet Training')
parser.add_argument('data', metavar='DIR',
                    help='path to dataset')
parser.add_argument('-a', '--arch', metavar='ARCH', default='resnet50',
                    choices=model_names,
                    help='model architecture: ' +
                        ' | '.join(model_names) +
                        ' (default: resnet50)')
parser.add_argument('-j', '--workers', default=32, type=int, metavar='N',
                    help='number of data loading workers (default: 32)')
parser.add_argument('--epochs', default=12, type=int, metavar='N',
                    help='number of total epochs to run')
parser.add_argument('--start-epoch', default=0, type=int, metavar='N',
                    help='manual epoch number (useful on restarts)')
parser.add_argument('-b', '--batch-size', default=256, type=int,
                    metavar='N',
                    help='mini-batch size (default: 256), this is the total '
                         'batch size of all GPUs on the current node when '
                         'using Data Parallel or Distributed Data Parallel')
parser.add_argument('--lr', '--learning-rate', default=0.03, type=float,
                    metavar='LR', help='initial learning rate', dest='lr')
parser.add_argument('--schedule', default=[11, 20], nargs='*', type=int,
                    help='learning rate schedule (when to drop lr by 10x)')
parser.add_argument('--momentum', default=0.9, type=float, metavar='M',
                    help='momentum of SGD solver')
parser.add_argument('--wd', '--weight-decay', default=0, type=float,
                    metavar='W', help='weight decay (default: 0)',
                    dest='weight_decay')
parser.add_argument('-p', '--print-freq', default=1000, type=int,
                    metavar='N', help='print frequency (default: 10)')
parser.add_argument('--resume', default='', type=str, metavar='PATH',
                    help='path to latest checkpoint (default: none)')
parser.add_argument('--world-size', default=-1, type=int,
                    help='number of nodes for distributed training')
parser.add_argument('--rank', default=-1, type=int,
                    help='node rank for distributed training')
parser.add_argument('--dist-url', default='tcp://224.66.41.62:23456', type=str,
                    help='url used to set up distributed training')
parser.add_argument('--dist-backend', default='nccl', type=str,
                    help='distributed backend')
parser.add_argument('--seed', default=None, type=int,
                    help='seed for initializing training. ')
parser.add_argument('--gpu', default=None, type=int,
                    help='GPU id to use.')
parser.add_argument('--multiprocessing-distributed', action='store_true',
                    help='Use multi-processing distributed training to launch '
                         'N processes per node, which has N GPUs. This is the '
                         'fastest way to use PyTorch for either single node or '
                         'multi node data parallel training')

# moco specific configs:
parser.add_argument('--moco-dim', default=128, type=int,
                    help='feature dimension (default: 128)')
parser.add_argument('--moco-k', default=65536, type=int,
                    help='queue size; number of negative keys (default: 65536)')
parser.add_argument('--moco-m', default=0.999, type=float,
                    help='moco momentum of updating key encoder (default: 0.999)')
parser.add_argument('--moco-t', default=0.07, type=float,
                    help='softmax temperature (default: 0.07)')

# options for moco v2
parser.add_argument('--mlp', action='store_true',
                    help='use mlp head')
parser.add_argument('--aug-plus', action='store_true',
                    help='use moco v2 data augmentation')
parser.add_argument('--cos', action='store_true',
                    help='use cosine lr schedule')


def main():
    args = parser.parse_args()

    if args.seed is not None:
        random.seed(args.seed)
        torch.manual_seed(args.seed)
        cudnn.deterministic = True
        warnings.warn('You have chosen to seed training. '
                      'This will turn on the CUDNN deterministic setting, '
                      'which can slow down your training considerably! '
                      'You may see unexpected behavior when restarting '
                      'from checkpoints.')

    if args.gpu is not None:
        warnings.warn('You have chosen a specific GPU. This will completely '
                      'disable data parallelism.')

    if args.dist_url == "env://" and args.world_size == -1:
        args.world_size = int(os.environ["WORLD_SIZE"])

    args.distributed = args.world_size > 1 or args.multiprocessing_distributed

    ngpus_per_node = torch.cuda.device_count()
    if args.multiprocessing_distributed:
        # Since we have ngpus_per_node processes per node, the total world_size
        # needs to be adjusted accordingly
        args.world_size = ngpus_per_node * args.world_size
        # Use torch.multiprocessing.spawn to launch distributed processes: the
        # main_worker process function
        mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args))
    else:
        # Simply call main_worker function
        main_worker(args.gpu, ngpus_per_node, args)


def main_worker(gpu, ngpus_per_node, args):

    args.gpu = gpu

    print(args)

    # suppress printing if not master
    if args.multiprocessing_distributed and args.gpu != 0:
        def print_pass(*args):
            pass
        builtins.print = print_pass

    if args.gpu is not None:
        print("Use GPU: {} for training".format(args.gpu))

    if args.distributed:
        if args.dist_url == "env://" and args.rank == -1:
            args.rank = int(os.environ["RANK"])
        if args.multiprocessing_distributed:
            # For multiprocessing distributed training, rank needs to be the
            # global rank among all the processes
            args.rank = args.rank * ngpus_per_node + gpu
        dist.init_process_group(backend=args.dist_backend, init_method=args.dist_url,
                                world_size=args.world_size, rank=args.rank)
    # create model
    print("=> creating model '{}'".format(args.arch))
    model = moco.builder.MoCo(
        models.__dict__[args.arch],
        args.moco_dim, args.moco_k, args.moco_m, args.moco_t, args.mlp)
    print(model)

    if args.distributed:
        # For multiprocessing distributed, DistributedDataParallel constructor
        # should always set the single device scope, otherwise,
        # DistributedDataParallel will use all available devices.
        if args.gpu is not None:
            torch.cuda.set_device(args.gpu)
            model.cuda(args.gpu)
            # When using a single GPU per process and per
            # DistributedDataParallel, we need to divide the batch size
            # ourselves based on the total number of GPUs we have
            args.batch_size = int(args.batch_size / ngpus_per_node)
            args.workers = int((args.workers + ngpus_per_node - 1) / ngpus_per_node)
            model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu])
        else:
            model.cuda()
            # DistributedDataParallel will divide and allocate batch_size to all
            # available GPUs if device_ids are not set
            model = torch.nn.parallel.DistributedDataParallel(model)
    elif args.gpu is not None:
        torch.cuda.set_device(args.gpu)
        model = model.cuda(args.gpu)
        # comment out the following line for debugging
        raise NotImplementedError("Only DistributedDataParallel is supported.")
    else:
        # AllGather implementation (batch shuffle, queue update, etc.) in
        # this code only supports DistributedDataParallel.
        raise NotImplementedError("Only DistributedDataParallel is supported.")

    # define loss function (criterion) and optimizer
    criterion = nn.CrossEntropyLoss().cuda(args.gpu)

    optimizer = torch.optim.SGD(model.parameters(), args.lr,
                                momentum=args.momentum,
                                weight_decay=args.weight_decay)

    # optionally resume from a checkpoint
    if args.resume:
        if os.path.isfile(args.resume):
            print("=> loading checkpoint '{}'".format(args.resume))
            if args.gpu is None:
                checkpoint = torch.load(args.resume)
            else:
                # Map model to be loaded to specified single gpu.
                loc = 'cuda:{}'.format(args.gpu)
                checkpoint = torch.load(args.resume, map_location=loc)
            args.start_epoch = checkpoint['epoch']
            model.load_state_dict(checkpoint['state_dict'])
            optimizer.load_state_dict(checkpoint['optimizer'])
            print("=> loaded checkpoint '{}' (epoch {})"
                  .format(args.resume, checkpoint['epoch']))
        else:
            print("=> no checkpoint found at '{}'".format(args.resume))

    cudnn.benchmark = True

    # Data loading code
    traindir = args.data
    normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                     std=[0.229, 0.224, 0.225])
    if args.aug_plus:
        # MoCo v2's aug: similar to SimCLR https://arxiv.org/abs/2002.05709
        augmentation = [
            transforms.RandomResizedCrop(224, scale=(0.2, 1.)),
            transforms.RandomApply([
                transforms.ColorJitter(0.4, 0.4, 0.4, 0.1)  # not strengthened
            ], p=0.8),
            transforms.RandomGrayscale(p=0.2),
            transforms.RandomApply([moco.loader.GaussianBlur([.1, 2.])], p=0.5),
            transforms.RandomHorizontalFlip(),
            transforms.ToTensor(),
            normalize
        ]
    else:
        # MoCo v1's aug: the same as InstDisc https://arxiv.org/abs/1805.01978
        augmentation = [
            transforms.RandomResizedCrop(224, scale=(0.2, 1.)),
            transforms.RandomGrayscale(p=0.2),
            transforms.ColorJitter(0.4, 0.4, 0.4, 0.4),
            transforms.RandomHorizontalFlip(),
            transforms.ToTensor(),
            normalize
        ]

    train_dataset = datasets.ImageFolder(
        traindir,
        moco.loader.TwoCropsTransform(transforms.Compose(augmentation)))

    print('Dataset size:', len(train_dataset))

    if args.distributed:
        train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset)
    else:
        train_sampler = None

    train_loader = torch.utils.data.DataLoader(
        train_dataset, batch_size=args.batch_size, shuffle=(train_sampler is None),
        num_workers=args.workers, pin_memory=True, sampler=train_sampler, drop_last=True)

    print('Starting training ...')
    for epoch in range(args.start_epoch, args.epochs):
        if args.distributed:
            train_sampler.set_epoch(epoch)
        adjust_learning_rate(optimizer, epoch, args)

        print('Start of epoch ', epoch)

        # train for one epoch
        train(train_loader, model, criterion, optimizer, epoch, args)

        if not args.multiprocessing_distributed or (args.multiprocessing_distributed
                and args.rank % ngpus_per_node == 0):
            save_checkpoint({
                'epoch': epoch + 1,
                'arch': args.arch,
                'state_dict': model.state_dict(),
                'optimizer' : optimizer.state_dict(),
            }, is_best=False, filename='moco_img_checkpoint_{:04d}.pth.tar'.format(epoch))


def train(train_loader, model, criterion, optimizer, epoch, args):
    batch_time = AverageMeter('Time', ':6.3f')
    data_time = AverageMeter('Data', ':6.3f')
    losses = AverageMeter('Loss', ':.4e')
    top1 = AverageMeter('Acc@1', ':6.2f')
    top5 = AverageMeter('Acc@5', ':6.2f')
    progress = ProgressMeter(
        len(train_loader),
        [batch_time, data_time, losses, top1, top5],
        prefix="Epoch: [{}]".format(epoch))

    # switch to train mode
    model.train()

    end = time.time()
    for i, (images, _) in enumerate(train_loader):

        # measure data loading time
        data_time.update(time.time() - end)

        if args.gpu is not None:
            images[0] = images[0].cuda(args.gpu, non_blocking=True)
            images[1] = images[1].cuda(args.gpu, non_blocking=True)

        # compute output
        output, target = model(im_q=images[0], im_k=images[1])
        loss = criterion(output, target)

        # acc1/acc5 are (K+1)-way contrast classifier accuracy
        # measure accuracy and record loss
        acc1, acc5 = accuracy(output, target, topk=(1, 5))
        losses.update(loss.item(), images[0].size(0))
        top1.update(acc1[0], images[0].size(0))
        top5.update(acc5[0], images[0].size(0))

        # compute gradient and do SGD step
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # measure elapsed time
        batch_time.update(time.time() - end)
        end = time.time()

        if i % args.print_freq == 0:
            progress.display(i)


def save_checkpoint(state, is_best, filename='checkpoint.pth.tar'):
    torch.save(state, filename)
    if is_best:
        shutil.copyfile(filename, 'model_best.pth.tar')


class AverageMeter(object):
    """Computes and stores the average and current value"""
    def __init__(self, name, fmt=':f'):
        self.name = name
        self.fmt = fmt
        self.reset()

    def reset(self):
        self.val = 0
        self.avg = 0
        self.sum = 0
        self.count = 0

    def update(self, val, n=1):
        self.val = val
        self.sum += val * n
        self.count += n
        self.avg = self.sum / self.count

    def __str__(self):
        fmtstr = '{name} {val' + self.fmt + '} ({avg' + self.fmt + '})'
        return fmtstr.format(**self.__dict__)


class ProgressMeter(object):
    def __init__(self, num_batches, meters, prefix=""):
        self.batch_fmtstr = self._get_batch_fmtstr(num_batches)
        self.meters = meters
        self.prefix = prefix

    def display(self, batch):
        entries = [self.prefix + self.batch_fmtstr.format(batch)]
        entries += [str(meter) for meter in self.meters]
        print('\t'.join(entries))

    def _get_batch_fmtstr(self, num_batches):
        num_digits = len(str(num_batches // 1))
        fmt = '{:' + str(num_digits) + 'd}'
        return '[' + fmt + '/' + fmt.format(num_batches) + ']'


def adjust_learning_rate(optimizer, epoch, args):
    """Decay the learning rate based on schedule"""
    lr = args.lr
    if args.cos:  # cosine lr schedule
        lr *= 0.5 * (1. + math.cos(math.pi * epoch / args.epochs))
    else:  # stepwise lr schedule
        for milestone in args.schedule:
            lr *= 0.1 if epoch >= milestone else 1.
    for param_group in optimizer.param_groups:
        param_group['lr'] = lr


def accuracy(output, target, topk=(1,)):
    """Computes the accuracy over the k top predictions for the specified values of k"""
    with torch.no_grad():
        maxk = max(topk)
        batch_size = target.size(0)

        _, pred = output.topk(maxk, 1, True, True)
        pred = pred.t()
        correct = pred.eq(target.view(1, -1).expand_as(pred))

        res = []
        for k in topk:
            correct_k = correct[:k].view(-1).float().sum(0, keepdim=True)
            res.append(correct_k.mul_(100.0 / batch_size))
        return res


if __name__ == '__main__':
    main()


================================================
FILE: moco_temp.py
================================================
#!/usr/bin/env python
# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved
import argparse
import builtins
import math
import os
import random
import shutil
import time
import warnings

import torch
import torch.nn as nn
import torch.nn.parallel
import torch.backends.cudnn as cudnn
import torch.distributed as dist
import torch.optim
import torch.multiprocessing as mp
import torch.utils.data
import torch.utils.data.distributed
import torchvision.transforms as transforms
import torchvision.datasets as datasets
import torchvision.models as models

import moco.loader
import moco.builder
from moco_utils import DistributedProxySampler, ContrastiveBatchSampler

model_names = sorted(name for name in models.__dict__
    if name.islower() and not name.startswith("__")
    and callable(models.__dict__[name]))

parser = argparse.ArgumentParser(description='PyTorch ImageNet Training')
parser.add_argument('data', metavar='DIR',
                    help='path to dataset')
parser.add_argument('-a', '--arch', metavar='ARCH', default='resnet50',
                    choices=model_names,
                    help='model architecture: ' +
                        ' | '.join(model_names) +
                        ' (default: resnet50)')
parser.add_argument('-j', '--workers', default=32, type=int, metavar='N',
                    help='number of data loading workers (default: 32)')
parser.add_argument('--epochs', default=12, type=int, metavar='N',
                    help='number of total epochs to run')
parser.add_argument('--start-epoch', default=0, type=int, metavar='N',
                    help='manual epoch number (useful on restarts)')
parser.add_argument('-b', '--batch-size', default=256, type=int,
                    metavar='N',
                    help='mini-batch size (default: 256), this is the total '
                         'batch size of all GPUs on the current node when '
                         'using Data Parallel or Distributed Data Parallel')
parser.add_argument('--lr', '--learning-rate', default=0.03, type=float,
                    metavar='LR', help='initial learning rate', dest='lr')
parser.add_argument('--schedule', default=[11, 20], nargs='*', type=int,
                    help='learning rate schedule (when to drop lr by 10x)')
parser.add_argument('--momentum', default=0.9, type=float, metavar='M',
                    help='momentum of SGD solver')
parser.add_argument('--wd', '--weight-decay', default=0, type=float,
                    metavar='W', help='weight decay (default: 0)',
                    dest='weight_decay')
parser.add_argument('-p', '--print-freq', default=1000, type=int,
                    metavar='N', help='print frequency (default: 10)')
parser.add_argument('--resume', default='', type=str, metavar='PATH',
                    help='path to latest checkpoint (default: none)')
parser.add_argument('--world-size', default=-1, type=int,
                    help='number of nodes for distributed training')
parser.add_argument('--rank', default=-1, type=int,
                    help='node rank for distributed training')
parser.add_argument('--dist-url', default='tcp://224.66.41.62:23456', type=str,
                    help='url used to set up distributed training')
parser.add_argument('--dist-backend', default='nccl', type=str,
                    help='distributed backend')
parser.add_argument('--seed', default=None, type=int,
                    help='seed for initializing training. ')
parser.add_argument('--gpu', default=None, type=int,
                    help='GPU id to use.')
parser.add_argument('--multiprocessing-distributed', action='store_true',
                    help='Use multi-processing distributed training to launch '
                         'N processes per node, which has N GPUs. This is the '
                         'fastest way to use PyTorch for either single node or '
                         'multi node data parallel training')

# moco specific configs:
parser.add_argument('--moco-dim', default=128, type=int,
                    help='feature dimension (default: 128)')
parser.add_argument('--moco-k', default=65536, type=int,
                    help='queue size; number of negative keys (default: 65536)')
parser.add_argument('--moco-m', default=0.999, type=float,
                    help='moco momentum of updating key encoder (default: 0.999)')
parser.add_argument('--moco-t', default=0.07, type=float,
                    help='softmax temperature (default: 0.07)')

# options for moco v2
parser.add_argument('--mlp', action='store_true',
                    help='use mlp head')
parser.add_argument('--aug-plus', action='store_true',
                    help='use moco v2 data augmentation')
parser.add_argument('--cos', action='store_true',
                    help='use cosine lr schedule')


def main():
    args = parser.parse_args()

    if args.seed is not None:
        random.seed(args.seed)
        torch.manual_seed(args.seed)
        cudnn.deterministic = True
        warnings.warn('You have chosen to seed training. '
                      'This will turn on the CUDNN deterministic setting, '
                      'which can slow down your training considerably! '
                      'You may see unexpected behavior when restarting '
                      'from checkpoints.')

    if args.gpu is not None:
        warnings.warn('You have chosen a specific GPU. This will completely '
                      'disable data parallelism.')

    if args.dist_url == "env://" and args.world_size == -1:
        args.world_size = int(os.environ["WORLD_SIZE"])

    args.distributed = args.world_size > 1 or args.multiprocessing_distributed

    ngpus_per_node = torch.cuda.device_count()
    if args.multiprocessing_distributed:
        # Since we have ngpus_per_node processes per node, the total world_size
        # needs to be adjusted accordingly
        args.world_size = ngpus_per_node * args.world_size
        # Use torch.multiprocessing.spawn to launch distributed processes: the
        # main_worker process function
        mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args))
    else:
        # Simply call main_worker function
        main_worker(args.gpu, ngpus_per_node, args)


def main_worker(gpu, ngpus_per_node, args):

    args.gpu = gpu

    # suppress printing if not master
    if args.multiprocessing_distributed and args.gpu != 0:
        def print_pass(*args):
            pass
        builtins.print = print_pass

    if args.gpu is not None:
        print("Use GPU: {} for training".format(args.gpu))

    if args.distributed:
        if args.dist_url == "env://" and args.rank == -1:
            args.rank = int(os.environ["RANK"])
        if args.multiprocessing_distributed:
            # For multiprocessing distributed training, rank needs to be the
            # global rank among all the processes
            args.rank = args.rank * ngpus_per_node + gpu
        dist.init_process_group(backend=args.dist_backend, init_method=args.dist_url,
                                world_size=args.world_size, rank=args.rank)
    # create model
    print("=> creating model '{}'".format(args.arch))
    model = moco.builder.MoCo(
        models.__dict__[args.arch],
        args.moco_dim, args.moco_k, args.moco_m, args.moco_t, args.mlp)
    print(model)

    if args.distributed:
        # For multiprocessing distributed, DistributedDataParallel constructor
        # should always set the single device scope, otherwise,
        # DistributedDataParallel will use all available devices.
        if args.gpu is not None:
            torch.cuda.set_device(args.gpu)
            model.cuda(args.gpu)
            # When using a single GPU per process and per
            # DistributedDataParallel, we need to divide the batch size
            # ourselves based on the total number of GPUs we have
            args.batch_size = int(args.batch_size / ngpus_per_node)
            args.workers = int((args.workers + ngpus_per_node - 1) / ngpus_per_node)
            model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu])
        else:
            model.cuda()
            # DistributedDataParallel will divide and allocate batch_size to all
            # available GPUs if device_ids are not set
            model = torch.nn.parallel.DistributedDataParallel(model)
    elif args.gpu is not None:
        torch.cuda.set_device(args.gpu)
        model = model.cuda(args.gpu)
        # comment out the following line for debugging
        raise NotImplementedError("Only DistributedDataParallel is supported.")
    else:
        # AllGather implementation (batch shuffle, queue update, etc.) in
        # this code only supports DistributedDataParallel.
        raise NotImplementedError("Only DistributedDataParallel is supported.")

    # define loss function (criterion) and optimizer
    criterion = nn.CrossEntropyLoss().cuda(args.gpu)

    optimizer = torch.optim.SGD(model.parameters(), args.lr,
                                momentum=args.momentum,
                                weight_decay=args.weight_decay)

    # optionally resume from a checkpoint
    if args.resume:
        if os.path.isfile(args.resume):
            print("=> loading checkpoint '{}'".format(args.resume))
            if args.gpu is None:
                checkpoint = torch.load(args.resume)
            else:
                # Map model to be loaded to specified single gpu.
                loc = 'cuda:{}'.format(args.gpu)
                checkpoint = torch.load(args.resume, map_location=loc)
            args.start_epoch = checkpoint['epoch']
            model.load_state_dict(checkpoint['state_dict'])
            optimizer.load_state_dict(checkpoint['optimizer'])
            print("=> loaded checkpoint '{}' (epoch {})"
                  .format(args.resume, checkpoint['epoch']))
        else:
            print("=> no checkpoint found at '{}'".format(args.resume))

    cudnn.benchmark = True

    # Data loading code
    traindir = args.data
    normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                     std=[0.229, 0.224, 0.225])
    if args.aug_plus:
        # MoCo v2's aug: similar to SimCLR https://arxiv.org/abs/2002.05709
        augmentation = [
            transforms.RandomResizedCrop(224, scale=(0.2, 1.)),
            transforms.RandomApply([
                transforms.ColorJitter(0.4, 0.4, 0.4, 0.1)  # not strengthened
            ], p=0.8),
            transforms.RandomGrayscale(p=0.2),
            transforms.RandomApply([moco.loader.GaussianBlur([.1, 2.])], p=0.5),
            transforms.RandomHorizontalFlip(),
            transforms.ToTensor(),
            normalize
        ]
    else:
        # MoCo v1's aug: the same as InstDisc https://arxiv.org/abs/1805.01978
        augmentation = [
            transforms.RandomResizedCrop(224, scale=(0.2, 1.)),
            transforms.RandomGrayscale(p=0.2),
            transforms.ColorJitter(0.4, 0.4, 0.4, 0.4),
            transforms.RandomHorizontalFlip(),
            transforms.ToTensor(),
            normalize
        ]

    train_dataset = datasets.ImageFolder(
        traindir, transforms.Compose(augmentation))

    print('Dataset size:', len(train_dataset))

    if args.distributed:
        train_sampler = DistributedProxySampler(ContrastiveBatchSampler(
            train_dataset, args.batch_size, 1, False))
    else:
        train_sampler = None

    train_loader = torch.utils.data.DataLoader(train_dataset, shuffle=(train_sampler is None),
                                               num_workers=args.workers, pin_memory=True,
                                               batch_sampler=train_sampler)

    print('Starting training ...')
    for epoch in range(args.start_epoch, args.epochs):
        if args.distributed:
            train_sampler.set_epoch(epoch)
        adjust_learning_rate(optimizer, epoch, args)

        print('Start of epoch ', epoch)

        # train for one epoch
        train(train_loader, model, criterion, optimizer, epoch, args)

        if not args.multiprocessing_distributed or (args.multiprocessing_distributed
                and args.rank % ngpus_per_node == 0):
            save_checkpoint({
                'epoch': epoch + 1,
                'arch': args.arch,
                'state_dict': model.state_dict(),
                'optimizer' : optimizer.state_dict(),
            }, is_best=False, filename='moco_temp_checkpoint_{:04d}.pth.tar'.format(epoch))


def train(train_loader, model, criterion, optimizer, epoch, args):
    batch_time = AverageMeter('Time', ':6.3f')
    data_time = AverageMeter('Data', ':6.3f')
    losses = AverageMeter('Loss', ':.4e')
    top1 = AverageMeter('Acc@1', ':6.2f')
    top5 = AverageMeter('Acc@5', ':6.2f')
    progress = ProgressMeter(
        len(train_loader),
        [batch_time, data_time, losses, top1, top5],
        prefix="Epoch: [{}]".format(epoch))

    # switch to train mode
    model.train()

    end = time.time()
    for i, (images, _) in enumerate(train_loader):

        # measure data loading time
        data_time.update(time.time() - end)

        if args.gpu is not None:
            images_0 = images[:images.size(0)//2].cuda(args.gpu, non_blocking=True)
            images_1 = images[images.size(0)//2:].cuda(args.gpu, non_blocking=True)

        # compute output
        output, target = model(im_q=images_0, im_k=images_1)
        loss = criterion(output, target)

        # acc1/acc5 are (K+1)-way contrast classifier accuracy
        # measure accuracy and record loss
        acc1, acc5 = accuracy(output, target, topk=(1, 5))
        losses.update(loss.item(), images_0.size(0))
        top1.update(acc1[0], images_0.size(0))
        top5.update(acc5[0], images_0.size(0))

        # compute gradient and do SGD step
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # measure elapsed time
        batch_time.update(time.time() - end)
        end = time.time()

        if i % args.print_freq == 0:
            progress.display(i)


def save_checkpoint(state, is_best, filename='checkpoint.pth.tar'):
    torch.save(state, filename)
    if is_best:
        shutil.copyfile(filename, 'model_best.pth.tar')


class AverageMeter(object):
    """Computes and stores the average and current value"""
    def __init__(self, name, fmt=':f'):
        self.name = name
        self.fmt = fmt
        self.reset()

    def reset(self):
        self.val = 0
        self.avg = 0
        self.sum = 0
        self.count = 0

    def update(self, val, n=1):
        self.val = val
        self.sum += val * n
        self.count += n
        self.avg = self.sum / self.count

    def __str__(self):
        fmtstr = '{name} {val' + self.fmt + '} ({avg' + self.fmt + '})'
        return fmtstr.format(**self.__dict__)


class ProgressMeter(object):
    def __init__(self, num_batches, meters, prefix=""):
        self.batch_fmtstr = self._get_batch_fmtstr(num_batches)
        self.meters = meters
        self.prefix = prefix

    def display(self, batch):
        entries = [self.prefix + self.batch_fmtstr.format(batch)]
        entries += [str(meter) for meter in self.meters]
        print('\t'.join(entries))

    def _get_batch_fmtstr(self, num_batches):
        num_digits = len(str(num_batches // 1))
        fmt = '{:' + str(num_digits) + 'd}'
        return '[' + fmt + '/' + fmt.format(num_batches) + ']'


def adjust_learning_rate(optimizer, epoch, args):
    """Decay the learning rate based on schedule"""
    lr = args.lr
    if args.cos:  # cosine lr schedule
        lr *= 0.5 * (1. + math.cos(math.pi * epoch / args.epochs))
    else:  # stepwise lr schedule
        for milestone in args.schedule:
            lr *= 0.1 if epoch >= milestone else 1.
    for param_group in optimizer.param_groups:
        param_group['lr'] = lr


def accuracy(output, target, topk=(1,)):
    """Computes the accuracy over the k top predictions for the specified values of k"""
    with torch.no_grad():
        maxk = max(topk)
        batch_size = target.size(0)

        _, pred = output.topk(maxk, 1, True, True)
        pred = pred.t()
        correct = pred.eq(target.view(1, -1).expand_as(pred))

        res = []
        for k in topk:
            correct_k = correct[:k].view(-1).float().sum(0, keepdim=True)
            res.append(correct_k.mul_(100.0 / batch_size))
        return res


if __name__ == '__main__':
    main()


================================================
FILE: moco_utils.py
================================================
# Defines some util functions
import torch
from torch.utils.data import Sampler
from torch.utils.data.distributed import DistributedSampler


class DistributedProxySampler(DistributedSampler):
    """Sampler that restricts data loading to a subset of input sampler indices.

    It is especially useful in conjunction with
    :class:`torch.nn.parallel.DistributedDataParallel`. In such case, each
    process can pass a DistributedSampler instance as a DataLoader sampler,
    and load a subset of the original dataset that is exclusive to it.

    .. note::
        Input sampler is assumed to be of constant size.

    Arguments:
        sampler: Input data sampler.
        num_replicas (optional): Number of processes participating in
            distributed training.
        rank (optional): Rank of the current process within num_replicas.
    """

    def __init__(self, sampler, num_replicas=None, rank=None):
        super(DistributedProxySampler, self).__init__(sampler, num_replicas=num_replicas, rank=rank, shuffle=False)
        self.sampler = sampler

    def __iter__(self):
        # deterministically shuffle based on epoch
        torch.manual_seed(self.epoch)
        indices = list(self.sampler)

        # add extra samples to make it evenly divisible
        indices += indices[:(self.total_size - len(indices))]
        if len(indices) != self.total_size:
            raise RuntimeError("{} vs {}".format(len(indices), self.total_size))

        # subsample
        indices = indices[self.rank:self.total_size:self.num_replicas]
        if len(indices) != self.num_samples:
            raise RuntimeError("{} vs {}".format(len(indices), self.num_samples))

        return iter(indices)

    def set_epoch(self, epoch):
        self.epoch = epoch

class ContrastiveBatchSampler(Sampler):
    def __init__(self, data_source, batch_size, pos_window, drop_last):
        self.data_source = data_source
        self.batch_size = batch_size
        self.pos_window = pos_window
        self.drop_last = drop_last
        self.n = len(self.data_source)

    def __iter__(self):
        for i in range(self.n // self.batch_size):
            x = torch.randint(low=0, high=self.n-1, size=(self.batch_size//2,),
                              dtype=torch.int64)
            y = x + torch.randint(low=-self.pos_window, high=self.pos_window, size=(self.batch_size//2,),
                                  dtype=torch.int64)
            y = torch.clamp(y, 0, self.n-1)
            z = x.tolist() + y.tolist()

            yield z

    def __len__(self):
        if self.drop_last:
            return self.n // self.batch_size
        else:
            return (self.n + self.batch_size - 1) // self.batch_size

================================================
FILE: read_saycam.py
================================================
import os
import sys
import argparse
import cv2
import numpy as np


parser = argparse.ArgumentParser(description='Read SAYCam videos')
parser.add_argument('data', metavar='DIR', help='path to SAYCam videos')
parser.add_argument('--save-dir', default='', type=str, help='save directory')
parser.add_argument('--fps', default=5, type=int, help='sampling rate (frames per second)')
parser.add_argument('--seg-len', default=288, type=int, help='segment length (seconds)')


if __name__ == '__main__':

    args = parser.parse_args()

    file_list = os.listdir(args.data)
    file_list.sort()

    class_counter = 0
    img_counter = 0
    file_counter = 0

    final_size = 224
    resized_minor_length = 256
    edge_filter = False
    n_imgs_per_class = args.seg_len * args.fps

    curr_dir_name = os.path.join(args.save_dir, 'class_{:04d}'.format(class_counter))
    os.mkdir(curr_dir_name)

    for file_indx in file_list:
        file_name = os.path.join(args.data, file_indx)

        cap = cv2.VideoCapture(file_name)
        frame_count = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
        frame_width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
        frame_height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
        frame_rate = int(cap.get(cv2.CAP_PROP_FPS))

        # take every sample_rate frames (30: 1fps, 15: 2fps, 10: 3fps, 6: 5fps, 5: 6fps, 3: 10fps, 2: 15fps, 1: 30fps)
        sample_rate = frame_rate // args.fps + 1

        print('Total frame count: ', frame_count)
        print('Native frame rate: ', frame_rate)

        fc = 0
        ret = True

        # Resize
        new_height = frame_height * resized_minor_length // min(frame_height, frame_width)
        new_width = frame_width * resized_minor_length // min(frame_height, frame_width)

        while (fc < frame_count):

            ret, frame = cap.read()

            if fc % sample_rate == 0 and ret:

                # Resize
                resized_frame = cv2.resize(frame, (new_width, new_height), interpolation=cv2.INTER_CUBIC)

                # Crop
                height, width, _ = resized_frame.shape
                startx = width // 2 - (final_size // 2)
                starty = height // 2 - (final_size // 2) - 16
                cropped_frame = resized_frame[starty:starty + final_size, startx:startx + final_size]
                assert cropped_frame.shape[0] == final_size and cropped_frame.shape[1] == final_size, \
                    (cropped_frame.shape, height, width)

                if edge_filter:
                    cropped_frame = cv2.Laplacian(cropped_frame, cv2.CV_64F, ksize=5)
                    img_min = cropped_frame.min()
                    img_max = cropped_frame.max()
                    cropped_frame = np.uint8(255 * (cropped_frame - img_min) / (img_max - img_min))

                cv2.imwrite(os.path.join(curr_dir_name, 'img_{:04d}.jpeg'.format(img_counter)), cropped_frame[::-1, ::-1, :])
                img_counter += 1

                if img_counter == n_imgs_per_class:
                    img_counter = 0
                    class_counter += 1
                    curr_dir_name = os.path.join(args.save_dir, 'class_{:04d}'.format(class_counter))
                    os.mkdir(curr_dir_name)

            fc += 1

        cap.release()

        file_counter += 1
        print('Completed video {:4d} of {:4d}'.format(file_counter, len(file_list)))

================================================
FILE: scripts/feature_animation.sh
================================================
#!/bin/bash

#SBATCH --nodes=1
#SBATCH --ntasks=8
#SBATCH --gres=gpu:1080ti:2
#SBATCH --mem=150GB
#SBATCH --time=1:00:00
#SBATCH --array=0
#SBATCH --job-name=feature_animation
#SBATCH --output=feature_animation_%A_%a.out

module purge
module load cuda-10.1

python -u /misc/vlgscratch4/LakeGroup/emin/baby-vision/feature_animation.py '/misc/vlgscratch4/LakeGroup/emin/headcam/preprocessing/feature_animation_imgs_intphys/' --model-path '/misc/vlgscratch4/LakeGroup/emin/headcam/preprocessing/self_supervised_models/TC-SAY.tar' --batch-size 900 --n_out 6269 --feature-idx 600

echo "Done"


================================================
FILE: scripts/feature_animation_class.sh
================================================
#!/bin/bash

#SBATCH --nodes=1
#SBATCH --exclude=hpc1,hpc2,hpc3,hpc4,hpc5,hpc6,hpc7,hpc8,hpc9,vine3,vine4,vine6,vine11,vine12,lion17,rose7,rose8,rose9
#SBATCH --ntasks=1
#SBATCH --gres=gpu:4
#SBATCH --mem=100GB
#SBATCH --time=1:00:00
#SBATCH --array=0
#SBATCH --job-name=feature_animation
#SBATCH --output=feature_animation_%A_%a.out

module purge
module load cuda-10.1

python -u /misc/vlgscratch4/LakeGroup/emin/baby-vision/feature_animation_class.py '/misc/vlgscratch4/LakeGroup/emin/headcam/preprocessing/feature_animation_computers/' --model-path '/misc/vlgscratch4/LakeGroup/emin/headcam/preprocessing/self_supervised_models/TC-S_labeledS_5_iid.tar' --batch-size 200 --n_out 26 --class-idx 6

echo "Done"


================================================
FILE: scripts/highly_activating_imgs.sh
================================================
#!/bin/bash

#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=1
#SBATCH --mem=16GB
#SBATCH --time=1:00:00
#SBATCH --array=0
#SBATCH --job-name=activating_imgs
#SBATCH --output=activating_imgs_%A_%a.out

module purge
module load cuda-10.1

python -u /misc/vlgscratch4/LakeGroup/emin/baby-vision/highly_activating_imgs.py '/misc/vlgscratch4/LakeGroup/emin/headcam/preprocessing/S_clean_labeled_data_1fps_4/' --n_out 26 --model-path 'mobilenetV2_S_5fps_2000cls_coloraug_labeled.tar' 

echo "Done"


================================================
FILE: scripts/hog_baseline.sh
================================================
#!/bin/bash

#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=100GB
#SBATCH --time=6:00:00
#SBATCH --array=0
#SBATCH --job-name=hog
#SBATCH --output=hog_%A_%a.out

module purge

python -u /misc/vlgscratch4/LakeGroup/emin/baby-vision/hog_baseline.py '/misc/vlgscratch4/LakeGroup/emin/headcam/preprocessing/toybox_1fps/'

echo "Done"


================================================
FILE: scripts/imagenet_finetuning.sh
================================================
#!/bin/bash

#SBATCH --nodes=1
#SBATCH --ntasks=16
#SBATCH --gres=gpu:titanrtx:4
#SBATCH --mem=150GB
#SBATCH --time=48:00:00
#SBATCH --array=0
#SBATCH --job-name=finetune_imgnet
#SBATCH --output=finetune_imgnet_%A_%a.out

module purge
module load cuda-10.1

#python -u /misc/vlgscratch4/LakeGroup/emin/baby-vision/imagenet_finetuning.py --freeze-trunk --n_out 6269 --resume 'resnext50_32x4d_augmentstrong_batch256_True_SAY_5_288_epoch_15.tar'
#python -u /misc/vlgscratch4/LakeGroup/emin/baby-vision/imagenet_finetuning.py --freeze-trunk --n_out 2765 --resume 'resnext50_32x4d_augmentstrong_batch256_True_S_5_288_epoch_10.tar'
#python -u /misc/vlgscratch4/LakeGroup/emin/baby-vision/imagenet_finetuning.py --freeze-trunk --n_out 1786 --resume 'resnext50_32x4d_augmentstrong_batch256_True_A_5_288_epoch_10.tar'
#python -u /misc/vlgscratch4/LakeGroup/emin/baby-vision/imagenet_finetuning.py --freeze-trunk --n_out 1718 --resume 'resnext50_32x4d_augmentstrong_batch256_True_Y_5_288_epoch_10.tar'

#python -u /misc/vlgscratch4/LakeGroup/emin/baby-vision/imagenet_finetuning.py --frac-retained 0.01 --n_out 6269 --resume 'resnext50_32x4d_augmentstrong_batch256_True_SAY_5_288_epoch_15.tar'
python -u /misc/vlgscratch4/LakeGroup/emin/baby-vision/imagenet_finetuning.py --freeze-trunk --n_out 1000 --resume 'ft_IN_resnext50_32x4d_augmentstrong_batch256_True_SAY_5_288_epoch_15.tar'

echo "Done"


================================================
FILE: scripts/linear_combination_maps.sh
================================================
#!/bin/bash

#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=1
#SBATCH --mem=16GB
#SBATCH --time=1:00:00
#SBATCH --array=0
#SBATCH --job-name=linear_maps
#SBATCH --output=linear_maps_%A_%a.out

module purge
module load cuda-10.1

python -u /misc/vlgscratch4/LakeGroup/emin/baby-vision/linear_combination_maps.py '/misc/vlgscratch4/LakeGroup/emin/headcam/preprocessing/S_clean_labeled_data_1fps_4/' --n_out 26 --model-path 'mobilenetV2_S_5fps_2000cls_coloraug_labeled.tar' 

echo "Done"


================================================
FILE: scripts/linear_decoding.sh
================================================
#!/bin/bash

#SBATCH --nodes=1
#SBATCH --ntasks=8
#SBATCH --gres=gpu:titanrtx:2
#SBATCH --mem=150GB
#SBATCH --time=12:00:00
#SBATCH --array=0
#SBATCH --job-name=linear_decoding
#SBATCH --output=linear_decoding_%A_%a.out

module purge
module load cuda-10.1

#python -u /misc/vlgscratch4/LakeGroup/emin/baby_vision/linear_decoding.py '/misc/vlgscratch4/LakeGroup/emin/headcam/preprocessing/S_clean_labeled_data_1fps_4/' --model-name 'random' --num-classes 26 --subsample

#python -u /misc/vlgscratch4/LakeGroup/emin/baby-vision/linear_decoding.py '/misc/vlgscratch4/LakeGroup/emin/headcam/preprocessing/S_clean_labeled_data_1fps_4/' --model-name 'moco_img_0005' --num-classes 26 --subsample

#python -u /misc/vlgscratch4/LakeGroup/emin/baby-vision/linear_decoding.py '/misc/vlgscratch4/LakeGroup/emin/headcam/preprocessing/S_clean_labeled_data_1fps_4/' --model-name 'moco_temp_0005' --num-classes 26 --subsample 

#python -u /misc/vlgscratch4/LakeGroup/emin/baby-vision/linear_decoding.py '/misc/vlgscratch4/LakeGroup/emin/headcam/preprocessing/S_clean_labeled_data_1fps_5/' --model-name 'TC-S' --num-outs 2765

#python -u /misc/vlgscratch4/LakeGroup/emin/baby-vision/linear_decoding.py '/misc/vlgscratch4/LakeGroup/emin/headcam/preprocessing/S_clean_labeled_data_1fps_4/' --model-name 'mobilenetV2_A_5fps_2000cls_coloraug' --num-outs 1786

#python -u /misc/vlgscratch4/LakeGroup/emin/baby-vision/linear_decoding.py '/misc/vlgscratch4/LakeGroup/emin/headcam/preprocessing/S_clean_labeled_data_1fps_4/' --model-name 'mobilenetV2_Y_5fps_2000cls_coloraug' --num-outs 1718

python -u /misc/vlgscratch4/LakeGroup/emin/baby-vision/linear_decoding.py '/misc/vlgscratch4/LakeGroup/emin/headcam/preprocessing/S_clean_labeled_data_1fps_5/' --model-name 'TC-SAY' --num-outs 6269

#python -u /misc/vlgscratch4/LakeGroup/emin/baby-vision/linear_decoding.py '/misc/vlgscratch4/LakeGroup/emin/headcam/preprocessing/S_clean_labeled_data_1fps_5/' --model-name 'TC-S' --num-outs 2765

echo "Done"


================================================
FILE: scripts/moco_img.sh
================================================
#!/bin/bash

#SBATCH --nodes=1
#SBATCH --ntasks=16
#SBATCH --gres=gpu:titanrtx:4
#SBATCH --mem=150GB
#SBATCH --time=48:00:00
#SBATCH --array=0
#SBATCH --job-name=moco_img
#SBATCH --output=moco_img_%A_%a.out

module purge
module load cuda-10.1

python -u /misc/vlgscratch4/LakeGroup/emin/baby-vision/moco_img.py \
  -a resnext50_32x4d \
  --lr 0.015 \
  --batch-size 256 \
  --mlp \
  --moco-t 0.2 \
  --aug-plus --cos \
  --dist-url 'tcp://localhost:10001' \
  --multiprocessing-distributed \
  --world-size 1 --rank 0 \
  --start-epoch 0 \
  --resume '' \
  '/misc/vlgscratch4/LakeGroup/emin/headcam/preprocessing/S_data_5fps_2000cls_pytorch/'

echo "Done"


================================================
FILE: scripts/moco_temp.sh
================================================
#!/bin/bash

#SBATCH --nodes=1
#SBATCH --ntasks=16
#SBATCH --gres=gpu:v100:4
#SBATCH --mem=150GB
#SBATCH --time=48:00:00
#SBATCH --array=0
#SBATCH --job-name=moco_temp
#SBATCH --output=moco_temp_%A_%a.out

module purge
module load cuda-10.1

python -u /misc/vlgscratch4/LakeGroup/emin/baby-vision/moco_temp.py \
  -a resnext50_32x4d \
  --lr 0.015 \
  --batch-size 256 \
  --mlp \
  --moco-t 0.2 \
  --aug-plus --cos \
  --dist-url 'tcp://localhost:10001' \
  --multiprocessing-distributed \
  --world-size 1 --rank 0 \
  --start-epoch 0 \
  --resume '' \
  '/misc/vlgscratch4/LakeGroup/emin/headcam/preprocessing/S_data_5fps_2000cls_pytorch/'

echo "Done"


================================================
FILE: scripts/read_saycam.sh
================================================
#!/bin/bash

#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=16GB
#SBATCH --time=48:00:00
#SBATCH --array=0
#SBATCH --job-name=read_saycam
#SBATCH --output=read_saycam_%A_%a.out

module purge
module load cuda-10.1

python -u /misc/vlgscratch4/LakeGroup/emin/baby-vision/read_saycam.py \
  --save-dir '/misc/vlgscratch4/LakeGroup/emin/headcam/preprocessing/S_15fps_288s' \
  --fps 15 \
  --seg-len 288 \
  '/misc/vlgscratch4/LakeGroup/emin/headcam/data_2/S'

echo "Done"


================================================
FILE: scripts/selectivities.sh
================================================
#!/bin/bash

#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --gres=gpu:2
#SBATCH --mem=64GB
#SBATCH --time=1:00:00
#SBATCH --array=0
#SBATCH --job-name=selectivity
#SBATCH --output=selectivity_%A_%a.out

module purge
module load cuda-10.1

python -u /misc/vlgscratch4/LakeGroup/emin/baby-vision/selectivities.py \
--n_out 1000 \
--model-path '' \
--layer 18 \
'/misc/vlgscratch4/LakeGroup/emin/headcam/preprocessing/S_clean_labeled_data_1fps_4/'

echo "Done"


================================================
FILE: scripts/temporal_classification.sh
================================================
#!/bin/bash

#SBATCH --nodes=1
#SBATCH --ntasks=16
#SBATCH --gres=gpu:v100:4
#SBATCH --mem=150GB
#SBATCH --time=48:00:00
#SBATCH --array=0
#SBATCH --job-name=tempclas
#SBATCH --output=tempclas_%A_%a.out

module purge
module load cuda-10.1

#python -u /misc/vlgscratch4/LakeGroup/emin/baby-vision/temporal_classification.py --model 'resnext50_32x4d' --n_out 6269 --resume 'resnext50_32x4d_augmentstrong_batch256_True_SAY_5_288_epoch_15.tar' --start-epoch 16 '/misc/vlgscratch4/LakeGroup/emin/headcam/preprocessing/SAY_data_5fps_2000cls_pytorch/'

#python -u /misc/vlgscratch4/LakeGroup/emin/baby-vision/temporal_classification.py --model 'resnext50_32x4d' --n_out 2765 '/misc/vlgscratch4/LakeGroup/emin/headcam/preprocessing/S_data_5fps_2000cls_pytorch/'

#python -u /misc/vlgscratch4/LakeGroup/emin/baby-vision/temporal_classification.py --model 'resnext50_32x4d' --n_out 1786 '/misc/vlgscratch4/LakeGroup/emin/headcam/preprocessing/A_data_5fps_2000cls_pytorch/'

python -u /misc/vlgscratch4/LakeGroup/emin/baby-vision/temporal_classification.py --model 'resnext50_32x4d' --n_out 1718 '/misc/vlgscratch4/LakeGroup/emin/headcam/preprocessing/Y_data_5fps_2000cls_pytorch/'

echo "Done"


================================================
FILE: selectivities.py
================================================
'''Measure single feature class selectivities'''
import os
import argparse
import numpy as np
import torch
import torchvision.transforms as transforms
import torchvision.datasets as datasets
import torchvision.models as models
from torchvision.utils import make_grid


def extract_map_layer_7x7(mobilenetV2_model):
    layer_list = list(mobilenetV2_model.module.features.children())
    new_model = torch.nn.Sequential(*layer_list)
    return new_model

def extract_map_layer_14x14(mobilenetV2_model, layer):
    layer_list = list(mobilenetV2_model.module.features.children())
    new_layer_list = layer_list[:-layer]
    new_layer_list.append(layer_list[-layer].conv[0])
    new_model = torch.nn.Sequential(*new_layer_list)
    return new_model

def load_model(args):
    model = models.mobilenet_v2(pretrained=True)
    model.classifier = torch.nn.Linear(in_features=1280, out_features=args.n_out, bias=True)
    model = torch.nn.DataParallel(model).cuda()

    if args.model_path:
        if os.path.isfile(args.model_path):
            checkpoint = torch.load(args.model_path)
            model.load_state_dict(checkpoint['model_state_dict'])
        else:
            print("=> no checkpoint found at '{}'".format(args.model_path))

    return model

def load_data(data_dir, args):

    normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
    train_dataset = datasets.ImageFolder(
        data_dir,
        transforms.Compose([transforms.ToTensor(), normalize])
    )

    train_loader = torch.utils.data.DataLoader(
        train_dataset, batch_size=args.batch_size, shuffle=False,
        num_workers=args.workers, pin_memory=True, sampler=None
    )

    return train_loader

def predict(data_loader, model):

    targets = []
    preds = []

    # switch to evaluate mode
    model.eval()

    with torch.no_grad():
        for i, (images, target) in enumerate(data_loader):
            images = images.cuda()

            # compute predictions
            pred = model(images)
            pred = torch.mean(pred, dim=(2, 3))

            targets.append(target.cpu().numpy())
            preds.append(pred.cpu().numpy())

            print('Iter:', i)

    targets = np.concatenate(targets, axis=0)
    preds = np.concatenate(preds, axis=0)

    print('Targets size:', targets.shape)
    print('Preds size:', preds.shape)

    return targets, preds


if __name__ == '__main__':

    parser = argparse.ArgumentParser(description='Measure single feature class selectivities')
    parser.add_argument('data', metavar='DIR', help='path to dataset')
    parser.add_argument('--workers', default=4, type=int, help='number of data loading workers (default: 4)')
    parser.add_argument('--batch-size', default=580, type=int, help='mini-batch size, this is the total '
                                                                    'batch size of all GPUs on the current node when '
                                                                    'using Data Parallel or Distributed Data Parallel')
    parser.add_argument('--model-path', default='', type=str, help='path to model checkpoint '
                                                                   '(default: ImageNet-pretrained)')
    parser.add_argument('--n_out', default=1000, type=int, help='output dim of pre-trained model')
    parser.add_argument('--layer', default=1, type=int, choices=[1, 2, 6, 10, 14, 18], help='which layer?')

    args = parser.parse_args()

    model = load_model(args)
    if args.layer == 1:
        map_layer = extract_map_layer_7x7(model)
    else:
        map_layer = extract_map_layer_14x14(model, args.layer)

    data_loader = load_data(args.data, args)
    targets, preds = predict(data_loader, map_layer)

    n_classes = 26
    n_neurons = preds.shape[1]
    class_matrix_mean = np.zeros((n_neurons, n_classes))
    class_matrix_std = np.zeros((n_neurons, n_classes))

    for i in range(n_neurons):
        for j in range(n_classes):
            aux_vec = preds[targets==j, i]
            class_matrix_mean[i, j] = np.mean(aux_vec)
            class_matrix_std[i, j] = np.std(aux_vec)

    sorted_mean = np.sort(class_matrix_mean, axis=1)
    selectivity = (sorted_mean[:, -1] - np.mean(sorted_mean[:, :-1], axis=1)) / \
                  (sorted_mean[:, -1] + np.mean(sorted_mean[:, :-1], axis=1))

    print('Most selective 10 features:', np.argsort(selectivity)[-10:])
    print('Highest 10 selectivities:', np.sort(selectivity)[-10:])
    print('Selectivity shape:', selectivity.shape)

    np.save('selectivity_' + str(args.layer) + '.npy', selectivity)

================================================
FILE: temporal_classification.py
================================================
import argparse
import os
import random
import shutil
import time
import warnings

import torch
import torch.nn as nn
import torch.nn.parallel
import torch.backends.cudnn as cudnn
import torch.distributed as dist
import torch.optim
import torch.multiprocessing as mp
import torch.utils.data
import torch.utils.data.distributed
import torchvision.transforms as transforms
import torchvision.datasets as datasets
import torchvision.models as models
from utils import GaussianBlur


parser = argparse.ArgumentParser(description='Temporal classification with headcam data')
parser.add_argument('data', metavar='DIR', help='path to dataset')
parser.add_argument('--model', default='resnet50', choices=['resnet50', 'resnext101_32x8d', 'resnext50_32x4d',
                                                            'mobilenet_v2'], help='model')
parser.add_argument('-j', '--workers', default=32, type=int, metavar='N', help='number of data loading workers (default'
                                                                               ':16)')
parser.add_argument('--epochs', default=16, type=int, metavar='N', help='number of total epochs to run')
parser.add_argument('--start-epoch', default=0, type=int, metavar='N', help='manual epoch number (useful on restarts)')
parser.add_argument('-b', '--batch-size', default=256, type=int, metavar='N',
                    help='mini-batch size (default: 128), this is the total batch size of all GPUs on the current node '
                         'when using Data Parallel or Distributed Data Parallel')
parser.add_argument('--lr', '--learning-rate', default=0.0005, type=float, metavar='LR', help='initial learning rate',
                    dest='lr')
parser.add_argument('--wd', '--weight-decay', default=0.0, type=float, metavar='W', help='weight decay (default: 0)',
                    dest='weight_decay')
parser.add_argument('-p', '--print-freq', default=10000, type=int, metavar='N', help='print frequency (default: 250)')
parser.add_argument('--resume', default='', type=str, metavar='PATH', help='path to latest checkpoint (default: none)')
parser.add_argument('--world-size', default=-1, type=int, help='number of nodes for distributed training')
parser.add_argument('--rank', default=-1, type=int, help='node rank for distributed training')
parser.add_argument('--dist-url', default='tcp://224.66.41.62:23456', type=str, help='url used to set up distributed '
                                                                                     'training')
parser.add_argument('--dist-backend', default='nccl', type=str, help='distributed backend')
parser.add_argument('--gpu', default=None, type=int, help='GPU id to use.')
parser.add_argument('--multiprocessing-distributed', action='store_true',
                    help='Use multi-processing distributed training to launch '
                         'N processes per node, which has N GPUs. This is the '
                         'fastest way to use PyTorch for either single node or '
                         'multi node data parallel training')
parser.add_argument('--n_out', default=1000, type=int, help='output dim')
parser.add_argument('--augmentation', default=True, action='store_false', help='whether to use data augmentation?')


def main():
    args = parser.parse_args()

    print(args)

    if args.gpu is not None:
        warnings.warn('You have chosen a specific GPU. This will completely disable data parallelism.')

    if args.dist_url == "env://" and args.world_size == -1:
        args.world_size = int(os.environ["WORLD_SIZE"])

    args.distributed = args.world_size > 1 or args.multiprocessing_distributed

    ngpus_per_node = torch.cuda.device_count()
    if args.multiprocessing_distributed:
        # Since we have ngpus_per_node processes per node, the total world_size needs to be adjusted accordingly
        args.world_size = ngpus_per_node * args.world_size
        # Use torch.multiprocessing.spawn to launch distributed processes: the main_worker process function
        mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args))
    else:
        # Simply call main_worker function
        main_worker(args.gpu, ngpus_per_node, args)


def main_worker(gpu, ngpus_per_node, args):
    args.gpu = gpu

    if args.gpu is not None:
        print("Use GPU: {} for training".format(args.gpu))

    print('Model:', args.model)
    model = models.__dict__[args.model](pretrained=False)
    if args.model.startswith('res'):
        model.fc = torch.nn.Linear(in_features=2048, out_features=args.n_out, bias=True)
    else:
        model.classifier = torch.nn.Linear(in_features=1280, out_features=args.n_out, bias=True)

    # DataParallel will divide and allocate batch_size to all available GPUs
    model = torch.nn.DataParallel(model).cuda()

    # define loss function (criterion) and optimizer
    criterion = nn.CrossEntropyLoss().cuda(args.gpu)
    optimizer = torch.optim.Adam(model.parameters(), args.lr, weight_decay=args.weight_decay)
    cudnn.benchmark = True

    if args.resume:
        if os.path.isfile(args.resume):
            print(args.resume)
            checkpoint = torch.load(args.resume)
            model.load_state_dict(checkpoint['model_state_dict'])
            optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
        else:
            print("=> no checkpoint found at '{}'".format(args.resume))

    savefile_name = args.model + '_augmentstrong_batch256_' + str(args.augmentation) + '_Y_5_288'

    normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])

    if args.augmentation:
        train_dataset = datasets.ImageFolder(
            args.data,
            transforms.Compose([
                        transforms.RandomResizedCrop(224, scale=(0.2, 1.)),
                        transforms.RandomApply([transforms.ColorJitter(0.9, 0.9, 0.9, 0.5)], p=0.9),
                        transforms.RandomGrayscale(p=0.2),
                        transforms.RandomApply([GaussianBlur([.1, 2.])], p=0.5),
                        transforms.RandomHorizontalFlip(),
                        transforms.ToTensor(),
                        normalize
                        ])
        )
    else:
        train_dataset = datasets.ImageFolder(
            args.data,
            transforms.Compose([
                transforms.ToTensor(),
                normalize
            ])
        )

    train_loader = torch.utils.data.DataLoader(
        train_dataset, batch_size=args.batch_size, shuffle=True,
        num_workers=args.workers, pin_memory=True, sampler=None
    )

    acc1_list = []

    for epoch in range(args.start_epoch, args.epochs):

        # train for one epoch
        acc1 = train(train_loader, model, criterion, optimizer, epoch, args)
        acc1_list.append(acc1)

        torch.save({'acc1_list': acc1_list,
                    'model_state_dict': model.state_dict(),
                    'optimizer_state_dict': optimizer.state_dict()}, savefile_name + '_epoch_' + str(epoch) + '.tar')


def train(train_loader, model, criterion, optimizer, epoch, args):
    batch_time = AverageMeter('Time', ':6.3f')
    data_time = AverageMeter('Data', ':6.3f')
    losses = AverageMeter('Loss', ':.4e')
    top1 = AverageMeter('Acc@1', ':6.2f')
    top5 = AverageMeter('Acc@5', ':6.2f')
    progress = ProgressMeter(
        len(train_loader),
        [batch_time, data_time, losses, top1, top5],
        prefix="Epoch: [{}]".format(epoch))

    # switch to train mode
    model.train()

    end = time.time()
    for i, (images, target) in enumerate(train_loader):
        # measure data loading time
        data_time.update(time.time() - end)

        if args.gpu is not None:
            images = images.cuda(args.gpu, non_blocking=True)
        target = target.cuda(args.gpu, non_blocking=True)

        # compute output
        output = model(images)
        loss = criterion(output, target)

        # measure accuracy and record loss
        acc1, acc5 = accuracy(output, target, topk=(1, 5))
        losses.update(loss.item(), images.size(0))
        top1.update(acc1[0], images.size(0))
        top5.update(acc5[0], images.size(0))

        # compute gradient and do SGD step
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # measure elapsed time
        batch_time.update(time.time() - end)
        end = time.time()

        if i % args.print_freq == 0:
            progress.display(i)

    return top1.avg.cpu().numpy()


class AverageMeter(object):
    """Computes and stores the average and current value"""
    def __init__(self, name, fmt=':f'):
        self.name = name
        self.fmt = fmt
        self.reset()

    def reset(self):
        self.val = 0
        self.avg = 0
        self.sum = 0
        self.count = 0

    def update(self, val, n=1):
        self.val = val
        self.sum += val * n
        self.count += n
        self.avg = self.sum / self.count

    def __str__(self):
        fmtstr = '{name} {val' + self.fmt + '} ({avg' + self.fmt + '})'
        return fmtstr.format(**self.__dict__)


class ProgressMeter(object):
    def __init__(self, num_batches, meters, prefix=""):
        self.batch_fmtstr = self._get_batch_fmtstr(num_batches)
        self.meters = meters
        self.prefix = prefix

    def display(self, batch):
        entries = [self.prefix + self.batch_fmtstr.format(batch)]
        entries += [str(meter) for meter in self.meters]
        print('\t'.join(entries))

    def _get_batch_fmtstr(self, num_batches):
        num_digits = len(str(num_batches // 1))
        fmt = '{:' + str(num_digits) + 'd}'
        return '[' + fmt + '/' + fmt.format(num_batches) + ']'


def accuracy(output, target, topk=(1,)):
    """Computes the accuracy over the k top predictions for the specified values of k"""
    with torch.no_grad():
        maxk = max(topk)
        batch_size = target.size(0)

        _, pred = output.topk(maxk, 1, True, True)
        pred = pred.t()
        correct = pred.eq(target.view(1, -1).expand_as(pred))

        res = []
        for k in topk:
            correct_k = correct[:k].view(-1).float().sum(0, keepdim=True)
            res.append(correct_k.mul_(100.0 / batch_size))
        return res


if __name__ == '__main__':
    main()

================================================
FILE: utils.py
================================================
# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved
from PIL import ImageFilter
import random


class GaussianBlur(object):
    """Gaussian blur augmentation in SimCLR https://arxiv.org/abs/2002.05709"""

    def __init__(self, sigma=[.1, 2.]):
        self.sigma = sigma

    def __call__(self, x):
        sigma = random.uniform(self.sigma[0], self.sigma[1])
        x = x.filter(ImageFilter.GaussianBlur(radius=sigma))
        return x