Repository: irhumshafkat/R2Plus1D-PyTorch Branch: master Commit: 09b55c6f3c89 Files: 7 Total size: 26.2 KB Directory structure: gitextract_1opai4ny/ ├── .gitignore ├── LICENSE ├── README.md ├── dataset.py ├── module.py ├── network.py └── trainer.py ================================================ FILE CONTENTS ================================================ ================================================ FILE: .gitignore ================================================ # Byte-compiled / optimized / DLL files __pycache__/ *.py[cod] *$py.class # C extensions *.so # Distribution / packaging .Python build/ develop-eggs/ dist/ downloads/ eggs/ .eggs/ lib/ lib64/ parts/ sdist/ var/ wheels/ *.egg-info/ .installed.cfg *.egg MANIFEST # PyInstaller # Usually these files are written by a python script from a template # before PyInstaller builds the exe, so as to inject date/other infos into it. *.manifest *.spec # Installer logs pip-log.txt pip-delete-this-directory.txt # Unit test / coverage reports htmlcov/ .tox/ .coverage .coverage.* .cache nosetests.xml coverage.xml *.cover .hypothesis/ .pytest_cache/ # Translations *.mo *.pot # Django stuff: *.log local_settings.py db.sqlite3 # Flask stuff: instance/ .webassets-cache # Scrapy stuff: .scrapy # Sphinx documentation docs/_build/ # PyBuilder target/ # Jupyter Notebook .ipynb_checkpoints # pyenv .python-version # celery beat schedule file celerybeat-schedule # SageMath parsed files *.sage.py # Environments .env .venv env/ venv/ ENV/ env.bak/ venv.bak/ # Spyder project settings .spyderproject .spyproject # Rope project settings .ropeproject # mkdocs documentation /site # mypy .mypy_cache/ ================================================ FILE: LICENSE ================================================ MIT License Copyright (c) 2018 Irhum Shafkat Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ================================================ FILE: README.md ================================================ # R2Plus1D-PyTorch PyTorch implementation of the R2Plus1D convolution based ResNet architecture described in the paper "A Closer Look at Spatiotemporal Convolutions for Action Recognition" Link to original: [paper](https://arxiv.org/abs/1711.11248) and [code](https://github.com/facebookresearch/R2Plus1D) ***NOTE: This repository has been archived, although forks and other work that extend on top of this remain welcome*** ## Requirements R2Plus1D-PyTorch has the following requirements * PyTorch 0.4 and dependencies * OpenCV (tested on 3.4.0.12) * tqdm (for progress bars) ### About this repository This repository consists of four python files: * `module.py` - Contains an implementation of the factored, R2Plus1D convolution the entire implementation is based around. It is designed to be a replacement for nn.Conv3D in the appropriate scenario * `network.py` - Uses `module.py` to build up the residual network described in the paper * `dataset.py` - Implements a PyTorch dataset, that can load videos with appropriate labels from a given directory. * `trainer.py` - A mildly modified version of the script from the PyTorch [tutorials](https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html) to train the model. Features saving and restoring capabilities. ### Training on Kinetics-400/600 This repository does not include a crawler or downloader for the Kinetics-400/600 dataset, however, one can be found [here](https://github.com/activitynet/ActivityNet/tree/master/Crawler/Kinetics). It is strongly recommended to downsample the videos prior to training (and not on the fly), using a tool such as ffmpeg. If using the crawler, this can be done by adding `"-vf", "scale=172:128"` to the ffmpeg command list in the download clip function. ### Training in general This repository is designed for the ResNet to be trained on any dataset of videos in general, using the VideoDataloader class from dataset.py . It expects the videos to be arranged in a directory -> [train/val] folders -> [class_label] folders (one for each class) -> videos (the files themselves). Forks and fixes of this repo are highly welcome! ================================================ FILE: dataset.py ================================================ import os from pathlib import Path import cv2 import numpy as np from torch.utils.data import DataLoader, Dataset class VideoDataset(Dataset): r"""A Dataset for a folder of videos. Expects the directory structure to be directory->[train/val/test]->[class labels]->[videos]. Initializes with a list of all file names, along with an array of labels, with label being automatically inferred from the respective folder names. Args: directory (str): The path to the directory containing the train/val/test datasets mode (str, optional): Determines which folder of the directory the dataset will read from. Defaults to 'train'. clip_len (int, optional): Determines how many frames are there in each clip. Defaults to 8. """ def __init__(self, directory, mode='train', clip_len=8): folder = Path(directory)/mode # get the directory of the specified split self.clip_len = clip_len # the following three parameters are chosen as described in the paper section 4.1 self.resize_height = 128 self.resize_width = 171 self.crop_size = 112 # obtain all the filenames of files inside all the class folders # going through each class folder one at a time self.fnames, labels = [], [] for label in sorted(os.listdir(folder)): for fname in os.listdir(os.path.join(folder, label)): self.fnames.append(os.path.join(folder, label, fname)) labels.append(label) # prepare a mapping between the label names (strings) and indices (ints) self.label2index = {label:index for index, label in enumerate(sorted(set(labels)))} # convert the list of label names into an array of label indices self.label_array = np.array([self.label2index[label] for label in labels], dtype=int) def __getitem__(self, index): # loading and preprocessing. TODO move them to transform classes buffer = self.loadvideo(self.fnames[index]) buffer = self.crop(buffer, self.clip_len, self.crop_size) buffer = self.normalize(buffer) return buffer, self.label_array[index] def loadvideo(self, fname): # initialize a VideoCapture object to read video data into a numpy array capture = cv2.VideoCapture(fname) frame_count = int(capture.get(cv2.CAP_PROP_FRAME_COUNT)) frame_width = int(capture.get(cv2.CAP_PROP_FRAME_WIDTH)) frame_height = int(capture.get(cv2.CAP_PROP_FRAME_HEIGHT)) # create a buffer. Must have dtype float, so it gets converted to a FloatTensor by Pytorch later buffer = np.empty((frame_count, self.resize_height, self.resize_width, 3), np.dtype('float32')) count = 0 retaining = True # read in each frame, one at a time into the numpy buffer array while (count < frame_count and retaining): retaining, frame = capture.read() frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB) # will resize frames if not already final size # NOTE: strongly recommended to resize them during the download process. This script # will process videos of any size, but will take longer the larger the video file. if (frame_height != self.resize_height) or (frame_width != self.resize_width): frame = cv2.resize(frame, (self.resize_width, self.resize_height)) buffer[count] = frame count += 1 # release the VideoCapture once it is no longer needed capture.release() # convert from [D, H, W, C] format to [C, D, H, W] (what PyTorch uses) # D = Depth (in this case, time), H = Height, W = Width, C = Channels buffer = buffer.transpose((3, 0, 1, 2)) return buffer def crop(self, buffer, clip_len, crop_size): # randomly select time index for temporal jittering time_index = np.random.randint(buffer.shape[1] - clip_len) # randomly select start indices in order to crop the video height_index = np.random.randint(buffer.shape[2] - crop_size) width_index = np.random.randint(buffer.shape[3] - crop_size) # crop and jitter the video using indexing. The spatial crop is performed on # the entire array, so each frame is cropped in the same location. The temporal # jitter takes place via the selection of consecutive frames buffer = buffer[:, time_index:time_index + clip_len, height_index:height_index + crop_size, width_index:width_index + crop_size] return buffer def normalize(self, buffer): # Normalize the buffer # NOTE: Default values of RGB images normalization are used, as precomputed # mean and std_dev values (akin to ImageNet) were unavailable for Kinetics. Feel # free to push to and edit this section to replace them if found. buffer = (buffer - 128)/128 return buffer def __len__(self): return len(self.fnames) class VideoDataset1M(VideoDataset): r"""Dataset that implements VideoDataset, and produces exactly 1M augmented training samples every epoch. Args: directory (str): The path to the directory containing the train/val/test datasets mode (str, optional): Determines which folder of the directory the dataset will read from. Defaults to 'train'. clip_len (int, optional): Determines how many frames are there in each clip. Defaults to 8. """ def __init__(self, directory, mode='train', clip_len=8): # Initialize instance of original dataset class super(VideoDataset1M, self).__init__(directory, mode, clip_len) def __getitem__(self, index): # if we are to have 1M samples on every pass, we need to shuffle # the index to a number in the original range, or else we'll get an # index error. This is a legitimate operation, as even with the same # index being used multiple times, it'll be randomly cropped, and # be temporally jitterred differently on each pass, properly # augmenting the data. index = np.random.randint(len(self.fnames)) buffer = self.loadvideo(self.fnames[index]) buffer = self.crop(buffer, self.clip_len, self.crop_size) buffer = self.normalize(buffer) return buffer, self.label_array[index] def __len__(self): return 1000000 # manually set the length to 1 million ================================================ FILE: module.py ================================================ import math import torch.nn as nn from torch.nn.modules.utils import _triple class SpatioTemporalConv(nn.Module): r"""Applies a factored 3D convolution over an input signal composed of several input planes with distinct spatial and time axes, by performing a 2D convolution over the spatial axes to an intermediate subspace, followed by a 1D convolution over the time axis to produce the final output. Args: in_channels (int): Number of channels in the input tensor out_channels (int): Number of channels produced by the convolution kernel_size (int or tuple): Size of the convolving kernel stride (int or tuple, optional): Stride of the convolution. Default: 1 padding (int or tuple, optional): Zero-padding added to the sides of the input during their respective convolutions. Default: 0 bias (bool, optional): If ``True``, adds a learnable bias to the output. Default: ``True`` """ def __init__(self, in_channels, out_channels, kernel_size, stride=1, padding=0, bias=True): super(SpatioTemporalConv, self).__init__() # if ints are entered, convert them to iterables, 1 -> [1, 1, 1] kernel_size = _triple(kernel_size) stride = _triple(stride) padding = _triple(padding) # decomposing the parameters into spatial and temporal components by # masking out the values with the defaults on the axis that # won't be convolved over. This is necessary to avoid unintentional # behavior such as padding being added twice spatial_kernel_size = [1, kernel_size[1], kernel_size[2]] spatial_stride = [1, stride[1], stride[2]] spatial_padding = [0, padding[1], padding[2]] temporal_kernel_size = [kernel_size[0], 1, 1] temporal_stride = [stride[0], 1, 1] temporal_padding = [padding[0], 0, 0] # compute the number of intermediary channels (M) using formula # from the paper section 3.5 intermed_channels = int(math.floor((kernel_size[0] * kernel_size[1] * kernel_size[2] * in_channels * out_channels)/ \ (kernel_size[1]* kernel_size[2] * in_channels + kernel_size[0] * out_channels))) # the spatial conv is effectively a 2D conv due to the # spatial_kernel_size, followed by batch_norm and ReLU self.spatial_conv = nn.Conv3d(in_channels, intermed_channels, spatial_kernel_size, stride=spatial_stride, padding=spatial_padding, bias=bias) self.bn = nn.BatchNorm3d(intermed_channels) self.relu = nn.ReLU() # the temporal conv is effectively a 1D conv, but has batch norm # and ReLU added inside the model constructor, not here. This is an # intentional design choice, to allow this module to externally act # identical to a standard Conv3D, so it can be reused easily in any # other codebase self.temporal_conv = nn.Conv3d(intermed_channels, out_channels, temporal_kernel_size, stride=temporal_stride, padding=temporal_padding, bias=bias) def forward(self, x): x = self.relu(self.bn(self.spatial_conv(x))) x = self.temporal_conv(x) return x ================================================ FILE: network.py ================================================ import torch.nn as nn from torch.nn.modules.utils import _triple from module import SpatioTemporalConv class SpatioTemporalResBlock(nn.Module): r"""Single block for the ResNet network. Uses SpatioTemporalConv in the standard ResNet block layout (conv->batchnorm->ReLU->conv->batchnorm->sum->ReLU) Args: in_channels (int): Number of channels in the input tensor. out_channels (int): Number of channels in the output produced by the block. kernel_size (int or tuple): Size of the convolving kernels. downsample (bool, optional): If ``True``, the output size is to be smaller than the input. Default: ``False`` """ def __init__(self, in_channels, out_channels, kernel_size, downsample=False): super(SpatioTemporalResBlock, self).__init__() # If downsample == True, the first conv of the layer has stride = 2 # to halve the residual output size, and the input x is passed # through a seperate 1x1x1 conv with stride = 2 to also halve it. # no pooling layers are used inside ResNet self.downsample = downsample # to allow for SAME padding padding = kernel_size//2 if self.downsample: # downsample with stride =2 the input x self.downsampleconv = SpatioTemporalConv(in_channels, out_channels, 1, stride=2) self.downsamplebn = nn.BatchNorm3d(out_channels) # downsample with stride = 2when producing the residual self.conv1 = SpatioTemporalConv(in_channels, out_channels, kernel_size, padding=padding, stride=2) else: self.conv1 = SpatioTemporalConv(in_channels, out_channels, kernel_size, padding=padding) self.bn1 = nn.BatchNorm3d(out_channels) self.relu1 = nn.ReLU() # standard conv->batchnorm->ReLU self.conv2 = SpatioTemporalConv(out_channels, out_channels, kernel_size, padding=padding) self.bn2 = nn.BatchNorm3d(out_channels) self.outrelu = nn.ReLU() def forward(self, x): res = self.relu1(self.bn1(self.conv1(x))) res = self.bn2(self.conv2(res)) if self.downsample: x = self.downsamplebn(self.downsampleconv(x)) return self.outrelu(x + res) class SpatioTemporalResLayer(nn.Module): r"""Forms a single layer of the ResNet network, with a number of repeating blocks of same output size stacked on top of each other Args: in_channels (int): Number of channels in the input tensor. out_channels (int): Number of channels in the output produced by the layer. kernel_size (int or tuple): Size of the convolving kernels. layer_size (int): Number of blocks to be stacked to form the layer block_type (Module, optional): Type of block that is to be used to form the layer. Default: SpatioTemporalResBlock. downsample (bool, optional): If ``True``, the first block in layer will implement downsampling. Default: ``False`` """ def __init__(self, in_channels, out_channels, kernel_size, layer_size, block_type=SpatioTemporalResBlock, downsample=False): super(SpatioTemporalResLayer, self).__init__() # implement the first block self.block1 = block_type(in_channels, out_channels, kernel_size, downsample) # prepare module list to hold all (layer_size - 1) blocks self.blocks = nn.ModuleList([]) for i in range(layer_size - 1): # all these blocks are identical, and have downsample = False by default self.blocks += [block_type(out_channels, out_channels, kernel_size)] def forward(self, x): x = self.block1(x) for block in self.blocks: x = block(x) return x class R2Plus1DNet(nn.Module): r"""Forms the overall ResNet feature extractor by initializng 5 layers, with the number of blocks in each layer set by layer_sizes, and by performing a global average pool at the end producing a 512-dimensional vector for each element in the batch. Args: layer_sizes (tuple): An iterable containing the number of blocks in each layer block_type (Module, optional): Type of block that is to be used to form the layers. Default: SpatioTemporalResBlock. """ def __init__(self, layer_sizes, block_type=SpatioTemporalResBlock): super(R2Plus1DNet, self).__init__() # first conv, with stride 1x2x2 and kernel size 3x7x7 self.conv1 = SpatioTemporalConv(3, 64, [3, 7, 7], stride=[1, 2, 2], padding=[1, 3, 3]) # output of conv2 is same size as of conv1, no downsampling needed. kernel_size 3x3x3 self.conv2 = SpatioTemporalResLayer(64, 64, 3, layer_sizes[0], block_type=block_type) # each of the final three layers doubles num_channels, while performing downsampling # inside the first block self.conv3 = SpatioTemporalResLayer(64, 128, 3, layer_sizes[1], block_type=block_type, downsample=True) self.conv4 = SpatioTemporalResLayer(128, 256, 3, layer_sizes[2], block_type=block_type, downsample=True) self.conv5 = SpatioTemporalResLayer(256, 512, 3, layer_sizes[3], block_type=block_type, downsample=True) # global average pooling of the output self.pool = nn.AdaptiveAvgPool3d(1) def forward(self, x): x = self.conv1(x) x = self.conv2(x) x = self.conv3(x) x = self.conv4(x) x = self.conv5(x) x = self.pool(x) return x.view(-1, 512) class R2Plus1DClassifier(nn.Module): r"""Forms a complete ResNet classifier producing vectors of size num_classes, by initializng 5 layers, with the number of blocks in each layer set by layer_sizes, and by performing a global average pool at the end producing a 512-dimensional vector for each element in the batch, and passing them through a Linear layer. Args: num_classes(int): Number of classes in the data layer_sizes (tuple): An iterable containing the number of blocks in each layer block_type (Module, optional): Type of block that is to be used to form the layers. Default: SpatioTemporalResBlock. """ def __init__(self, num_classes, layer_sizes, block_type=SpatioTemporalResBlock): super(R2Plus1DClassifier, self).__init__() self.res2plus1d = R2Plus1DNet(layer_sizes, block_type) self.linear = nn.Linear(512, num_classes) def forward(self, x): x = self.res2plus1d(x) x = self.linear(x) return x ================================================ FILE: trainer.py ================================================ import os import time import numpy as np import torch from torch import nn, optim from torch.utils.data import DataLoader from tqdm import tqdm from dataset import VideoDataset, VideoDataset1M from network import R2Plus1DClassifier # Use GPU if available else revert to CPU device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") print("Device being used:", device) def train_model(num_classes, directory, layer_sizes=[2, 2, 2, 2], num_epochs=45, save=True, path="model_data.pth.tar"): """Initalizes and the model for a fixed number of epochs, using dataloaders from the specified directory, selected optimizer, scheduler, criterion, defualt otherwise. Features saving and restoration capabilities as well. Adapted from the PyTorch tutorial found here: https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html Args: num_classes (int): Number of classes in the data directory (str): Directory where the data is to be loaded from layer_sizes (list, optional): Number of blocks in each layer. Defaults to [2, 2, 2, 2], equivalent to ResNet18. num_epochs (int, optional): Number of epochs to train for. Defaults to 45. save (bool, optional): If true, the model will be saved to path. Defaults to True. path (str, optional): The directory to load a model checkpoint from, and if save == True, save to. Defaults to "model_data.pth.tar". """ # initalize the ResNet 18 version of this model model = R2Plus1DClassifier(num_classes=num_classes, layer_sizes=layer_sizes).to(device) criterion = nn.CrossEntropyLoss() # standard crossentropy loss for classification optimizer = optim.SGD(model.parameters(), lr=0.01) # hyperparameters as given in paper sec 4.1 scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.1) # the scheduler divides the lr by 10 every 10 epochs # prepare the dataloaders into a dict train_dataloader = DataLoader(VideoDataset(directory), batch_size=10, shuffle=True, num_workers=4) # IF training on Kinetics-600 and require exactly a million samples each epoch, # import VideoDataset1M and uncomment the following # train_dataloader = DataLoader(VideoDataset1M(directory), batch_size=32, num_workers=4) val_dataloader = DataLoader(VideoDataset(directory, mode='val'), batch_size=14, num_workers=4) dataloaders = {'train': train_dataloader, 'val': val_dataloader} dataset_sizes = {x: len(dataloaders[x].dataset) for x in ['train', 'val']} # saves the time the process was started, to compute total time at the end start = time.time() epoch_resume = 0 # check if there was a previously saved checkpoint if os.path.exists(path): # loads the checkpoint checkpoint = torch.load(path) print("Reloading from previously saved checkpoint") # restores the model and optimizer state_dicts model.load_state_dict(checkpoint['state_dict']) optimizer.load_state_dict(checkpoint['opt_dict']) # obtains the epoch the training is to resume from epoch_resume = checkpoint["epoch"] for epoch in tqdm(range(epoch_resume, num_epochs), unit="epochs", initial=epoch_resume, total=num_epochs): # each epoch has a training and validation step, in that order for phase in ['train', 'val']: # reset the running loss and corrects running_loss = 0.0 running_corrects = 0 # set model to train() or eval() mode depending on whether it is trained # or being validated. Primarily affects layers such as BatchNorm or Dropout. if phase == 'train': # scheduler.step() is to be called once every epoch during training scheduler.step() model.train() else: model.eval() for inputs, labels in dataloaders[phase]: # move inputs and labels to the device the training is taking place on inputs = inputs.to(device) labels = labels.to(device) optimizer.zero_grad() # keep intermediate states iff backpropagation will be performed. If false, # then all intermediate states will be thrown away during evaluation, to use # the least amount of memory possible. with torch.set_grad_enabled(phase=='train'): outputs = model(inputs) # we're interested in the indices on the max values, not the values themselves _, preds = torch.max(outputs, 1) loss = criterion(outputs, labels) # Backpropagate and optimize iff in training mode, else there's no intermediate # values to backpropagate with and will throw an error. if phase == 'train': loss.backward() optimizer.step() running_loss += loss.item() * inputs.size(0) running_corrects += torch.sum(preds == labels.data) epoch_loss = running_loss / dataset_sizes[phase] epoch_acc = running_corrects.double() / dataset_sizes[phase] print(f"{phase} Loss: {epoch_loss} Acc: {epoch_acc}") # save the model if save=True if save: torch.save({ 'epoch': epoch + 1, 'state_dict': model.state_dict(), 'acc': epoch_acc, 'opt_dict': optimizer.state_dict(), }, path) # print the total time needed, HH:MM:SS format time_elapsed = time.time() - start print(f"Training complete in {time_elapsed//3600}h {(time_elapsed%3600)//60}m {time_elapsed %60}s")