Repository: MaigoAkisame/cmu-thesis Branch: master Commit: 7578dda2de95 Files: 26 Total size: 93.5 KB Directory structure: gitextract_nbiev_ku/ ├── .gitignore ├── LICENSE ├── README.md ├── code/ │ ├── audioset/ │ │ ├── Net.py │ │ ├── eval-TALNet.sh │ │ ├── eval.py │ │ ├── train.py │ │ ├── util_f1.py │ │ ├── util_in.py │ │ └── util_out.py │ ├── dcase/ │ │ ├── Net.py │ │ ├── eval.py │ │ ├── train.py │ │ ├── util_f1.py │ │ ├── util_in.py │ │ └── util_out.py │ └── sequential/ │ ├── Net.py │ ├── ctc.py │ ├── ctl.py │ ├── eval.py │ ├── train.py │ ├── util_f1.py │ ├── util_in.py │ └── util_out.py ├── data/ │ └── download.sh └── workspace/ └── .gitignore ================================================ FILE CONTENTS ================================================ ================================================ FILE: .gitignore ================================================ *.pyc data/dcase data/audioset data/sequential workspace/dcase workspace/audioset workspace/sequential ================================================ FILE: LICENSE ================================================ MIT License Copyright (c) 2018 Yun Wang Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ================================================ FILE: README.md ================================================ # cmu-thesis This repository contains the code for three experiments in my PhD thesis, [Polyphonic Sound Event Detection with Weak Labeling](http://www.cs.cmu.edu/~yunwang/papers/cmu-thesis.pdf): * Sound event detection with **presence/absence labeling** on the **[DCASE 2017 challenge](http://www.cs.tut.fi/sgn/arg/dcase2017/challenge/task-large-scale-sound-event-detection)** (Chapter 3.2) * Sound event detection with **presence/absence labeling** on **[Google Audio Set](https://research.google.com/audioset/)** (Chapter 3.3) * Sound event detection with **sequential labeling** on a subset of **[Google Audio Set](https://research.google.com/audioset/)** (Chapter 4) ## Prerequisites Hardware: * A GPU * Large storage (1 TB recommended) Software: * Python 2.7 * PyTorch (I used version 0.4.0a0+d3b6c5e) * numpy, scipy, [joblib](https://pypi.org/project/joblib/) ## Quick Start ```python # Clone the repository git clone https://github.com/MaigoAkisame/cmu-thesis.git # Download the data: may take up to 1 day! cd cmu-thesis/data ./download.sh # Train a model for the DCASE experiment using default settings cd ../code/dcase python train.py # Needs to run on a GPU # Evaluate the model at Checkpoint 25 python eval.py --ckpt=25 # Needs to run on a GPU for the first time # Download and evaluate the TALNet model for the Audio Set experiment cd ../audioset ./eval-TALNet.sh # Needs to run on a GPU for the first time ``` ## Organization of the Repository ### code The `code` directory contains three sub-directories: `dcase`, `audioset`, and `sequential`. These contain the code for the three experiments. In each subdirectory: * `Net.py` defines the network architecture (you don't need to execute this script directly); * `train.py` trains the network; * `eval.py` evaluates the network's performance. The `train.py` and `eval.py` script can take many command line arguments, which specify the architecture of the network and the hyperparameters used during training. If you encounter "out of memory" errors, a good idea is to reduce the batch size. Some scripts that may be of special interest: * `code/*/util_in.py`: Implements data balancing so that each minibatch contains roughly equal numbers of recordings of each event type; * `code/sequential/ctc.py`: My implementation of connectionist temporal classification (CTC); * `code/sequential/ctl.py`: My implementation of connectionist temporal localization (CTL). ### data The script `data/download.sh` will download and extract the following three archives in the `data` directory: * [dcase.tgz](http://islpc21.is.cs.cmu.edu/yunwang/git/cmu-thesis/data/dcase.tgz) (4.9 GB) * [audioset.tgz](http://islpc21.is.cs.cmu.edu/yunwang/git/cmu-thesis/data/audioset.tgz) (341 GB) * [sequential.tgz](http://islpc21.is.cs.cmu.edu/yunwang/git/cmu-thesis/data/sequential.tgz) (63 GB) These archives contain Matlab data files (with the `.mat` extension) that store the filterbank features and ground truth labels. They can be loaded with the `scipy.io.loadmat` function in Python. Each Matlab file contains three matrices: * `feat`: Filterbank features, a float32 array of shape (n, 400, 64) (n recordings, 400 frames, 64 frequency bins); * `labels`: * Presence/absence labeling, a boolean array of shape (n, m) (n recordings, m event types), or * Strong labelng, a boolean array of shape (n, 100, m) (n recordings, 100 frames, m event types); * `hashes`: A character array of size (n, 11), containing the YouTube hash IDs of the recordings. Training recordings are organized by class (so data balancing can be done easily), and each Matlab file contains up to 101 recordings. Validation and test/evaluation recordings are stored in Matlab files that contain up to 500 recordings each. Because the data is so huge, I do not provide the code for downloading the raw data, extracting features, and organizing the features and labels into Matlab data files. The whole process took me more than a month and endless babysitting! ### workspace The training logs, trained models, predictions on the test/evaluation recordings, and evaluation results will be generated in this directory. The sub-directory names will reflect the network architecture and hyperparameters for training. The script `code/audioset/eval-TALNet.py` will download the TALNet model and store it at `workspace/audioset/TALNet/model/TALNet.pt`. At the time of my graduation (October 2018), this is the best model that can both classify and localize sound events on Google Audio Set. ## Citing If you use this code in your research, please cite my PhD thesis: * Yun Wang, "Polyphonic sound event detection with weak labeling", PhD thesis, Carnegie Mellon University, Oct. 2018. and/or the following publications: * Yun Wang, Juncheng Li and Florian Metze, "A comparison of five multiple instance learning pooling functions for sound event detection with weak labeling," arXiv e-prints, Oct. 2018. [Online]. Available: . * Yun Wang and Florian Metze, "Connectionist temporal localization for sound event detection with sequential labeling," arXiv e-prints, Oct. 2018. [Online]. Available: . ================================================ FILE: code/audioset/Net.py ================================================ import torch import torch.nn as nn import torch.nn.functional as F from torch.autograd import Variable import numpy class ConvBlock(nn.Module): def __init__(self, n_input_feature_maps, n_output_feature_maps, kernel_size, batch_norm = False, pool_stride = None): super(ConvBlock, self).__init__() assert all(x % 2 == 1 for x in kernel_size) self.n_input = n_input_feature_maps self.n_output = n_output_feature_maps self.kernel_size = kernel_size self.batch_norm = batch_norm self.pool_stride = pool_stride # "~batch_norm" should be written as "not batch_norm"; otherwise ~True will evaluate to -2 and be treated as True. # But I'll keep this error to avoid breaking existing models. self.conv = nn.Conv2d(self.n_input, self.n_output, self.kernel_size, padding = tuple(x/2 for x in self.kernel_size), bias = ~batch_norm) if batch_norm: self.bn = nn.BatchNorm2d(self.n_output) nn.init.xavier_uniform(self.conv.weight) def forward(self, x): x = self.conv(x) if self.batch_norm: x = self.bn(x) x = F.relu(x) if self.pool_stride is not None: x = F.max_pool2d(x, self.pool_stride) return x class Net(nn.Module): def __init__(self, args): super(Net, self).__init__() self.__dict__.update(args.__dict__) # Instill all args into self assert self.n_conv_layers % self.n_pool_layers == 0 self.input_n_freq_bins = n_freq_bins = 64 self.output_size = 527 self.conv = [] pool_interval = self.n_conv_layers / self.n_pool_layers n_input = 1 for i in range(self.n_conv_layers): if (i + 1) % pool_interval == 0: # this layer has pooling n_freq_bins /= 2 n_output = self.embedding_size / n_freq_bins pool_stride = (2, 2) if i < pool_interval * 2 else (1, 2) else: n_output = self.embedding_size * 2 / n_freq_bins pool_stride = None layer = ConvBlock(n_input, n_output, self.kernel_size, batch_norm = self.batch_norm, pool_stride = pool_stride) self.conv.append(layer) self.__setattr__('conv' + str(i + 1), layer) n_input = n_output self.gru = nn.GRU(self.embedding_size, self.embedding_size / 2, 1, batch_first = True, bidirectional = True) self.fc_prob = nn.Linear(self.embedding_size, self.output_size) if self.pooling == 'att': self.fc_att = nn.Linear(self.embedding_size, self.output_size) # Better initialization nn.init.orthogonal(self.gru.weight_ih_l0); nn.init.constant(self.gru.bias_ih_l0, 0) nn.init.orthogonal(self.gru.weight_hh_l0); nn.init.constant(self.gru.bias_hh_l0, 0) nn.init.orthogonal(self.gru.weight_ih_l0_reverse); nn.init.constant(self.gru.bias_ih_l0_reverse, 0) nn.init.orthogonal(self.gru.weight_hh_l0_reverse); nn.init.constant(self.gru.bias_hh_l0_reverse, 0) nn.init.xavier_uniform(self.fc_prob.weight); nn.init.constant(self.fc_prob.bias, 0) if self.pooling == 'att': nn.init.xavier_uniform(self.fc_att.weight); nn.init.constant(self.fc_att.bias, 0) def forward(self, x): x = x.view((-1, 1, x.size(1), x.size(2))) # x becomes (batch, channel, time, freq) for i in range(len(self.conv)): if self.dropout > 0: x = F.dropout(x, p = self.dropout, training = self.training) x = self.conv[i](x) # x becomes (batch, channel, time, freq) x = x.permute(0, 2, 1, 3).contiguous() # x becomes (batch, time, channel, freq) x = x.view((-1, x.size(1), x.size(2) * x.size(3))) # x becomes (batch, time, embedding_size) if self.dropout > 0: x = F.dropout(x, p = self.dropout, training = self.training) x, _ = self.gru(x) # x becomes (batch, time, embedding_size) if self.dropout > 0: x = F.dropout(x, p = self.dropout, training = self.training) frame_prob = F.sigmoid(self.fc_prob(x)) # shape of frame_prob: (batch, time, output_size) frame_prob = torch.clamp(frame_prob, 1e-7, 1 - 1e-7) if self.pooling == 'max': global_prob, _ = frame_prob.max(dim = 1) return global_prob, frame_prob elif self.pooling == 'ave': global_prob = frame_prob.mean(dim = 1) return global_prob, frame_prob elif self.pooling == 'lin': global_prob = (frame_prob * frame_prob).sum(dim = 1) / frame_prob.sum(dim = 1) return global_prob, frame_prob elif self.pooling == 'exp': global_prob = (frame_prob * frame_prob.exp()).sum(dim = 1) / frame_prob.exp().sum(dim = 1) return global_prob, frame_prob elif self.pooling == 'att': frame_att = F.softmax(self.fc_att(x), dim = 1) global_prob = (frame_prob * frame_att).sum(dim = 1) return global_prob, frame_prob, frame_att def predict(self, x, verbose = True, batch_size = 100): # Predict in batches. Both input and output are numpy arrays. # If verbose == True, return all of global_prob, frame_prob and att # If verbose == False, only return global_prob result = [] for i in range(0, len(x), batch_size): with torch.no_grad(): input = Variable(torch.from_numpy(x[i : i + batch_size])).cuda() output = self.forward(input) if not verbose: output = output[:1] result.append([var.data.cpu().numpy() for var in output]) result = tuple(numpy.concatenate(items) for items in zip(*result)) return result if verbose else result[0] ================================================ FILE: code/audioset/eval-TALNet.sh ================================================ TALNet_FILE=../../workspace/audioset/TALNet/model/TALNet.pt if ! [ -f $TALNet_FILE ]; then mkdir -p $(dirname $TALNet_FILE) wget -O $TALNet_FILE http://islpc21.is.cs.cmu.edu/yunwang/git/cmu-thesis/model/TALNet.pt fi python eval.py --TALNet ================================================ FILE: code/audioset/eval.py ================================================ import sys, os, os.path import argparse import numpy from util_out import * from util_f1 import * from scipy.io import loadmat, savemat # Parse input arguments def mybool(s): return s.lower() in ['t', 'true', 'y', 'yes', '1'] parser = argparse.ArgumentParser() parser.add_argument('--TALNet', action = 'store_true') # specify this to evaluate the pre-trained TALNet model parser.add_argument('--embedding_size', type = int, default = 1024) # this is the embedding size after a pooling layer # after a non-pooling layer, the embeddings size will be twice this much parser.add_argument('--n_conv_layers', type = int, default = 10) parser.add_argument('--kernel_size', type = str, default = '3') # 'n' or 'nxm' parser.add_argument('--n_pool_layers', type = int, default = 5) # the pooling layers will be inserted uniformly into the conv layers # the should be at least 2 and at most 6 pooling layers # the first two pooling layers will have stride (2,2); later ones will have stride (1,2) parser.add_argument('--batch_norm', type = mybool, default = True) parser.add_argument('--dropout', type = float, default = 0.0) parser.add_argument('--pooling', type = str, default = 'lin', choices = ['max', 'ave', 'lin', 'exp', 'att']) parser.add_argument('--batch_size', type = int, default = 250) parser.add_argument('--ckpt_size', type = int, default = 1000) # how many batches per checkpoint parser.add_argument('--optimizer', type = str, default = 'adam', choices = ['adam', 'sgd']) parser.add_argument('--init_lr', type = float, default = 1e-3) parser.add_argument('--lr_patience', type = int, default = 3) parser.add_argument('--lr_factor', type = float, default = 0.8) parser.add_argument('--random_seed', type = int, default = 15213) parser.add_argument('--ckpt', type = int) args = parser.parse_args() if 'x' not in args.kernel_size: args.kernel_size = args.kernel_size + 'x' + args.kernel_size # Locate model file and prepare directories for prediction and evaluation expid = 'TALNet' if args.TALNet else 'embed%d-%dC%dP-kernel%s-%s-drop%.1f-%s-batch%d-ckpt%d-%s-lr%.0e-pat%d-fac%.1f-seed%d' % ( args.embedding_size, args.n_conv_layers, args.n_pool_layers, args.kernel_size, 'bn' if args.batch_norm else 'nobn', args.dropout, args.pooling, args.batch_size, args.ckpt_size, args.optimizer, args.init_lr, args.lr_patience, args.lr_factor, args.random_seed ) WORKSPACE = os.path.join('../../workspace/audioset', expid) PRED_PATH = os.path.join(WORKSPACE, 'pred') if not os.path.exists(PRED_PATH): os.makedirs(PRED_PATH) EVAL_PATH = os.path.join(WORKSPACE, 'eval') if not os.path.exists(EVAL_PATH): os.makedirs(EVAL_PATH) if args.TALNet: MODEL_FILE = os.path.join(WORKSPACE, 'model', 'TALNet.pt') PRED_FILE = os.path.join(PRED_PATH, 'TALNet.mat') EVAL_FILE = os.path.join(EVAL_PATH, 'TALNet.txt') else: MODEL_FILE = os.path.join(WORKSPACE, 'model', 'checkpoint%d.pt' % args.ckpt) PRED_FILE = os.path.join(PRED_PATH, 'checkpoint%d.mat' % args.ckpt) EVAL_FILE = os.path.join(EVAL_PATH, 'checkpoint%d.txt' % args.ckpt) with open(EVAL_FILE, 'w'): pass def write_log(s): print s with open(EVAL_FILE, 'a') as f: f.write(s + '\n') if os.path.exists(PRED_FILE): # Load saved predictions, no need to use GPU data = loadmat(PRED_FILE) dcase_thres = data['dcase_thres'].ravel() dcase_test_y = data['dcase_test_y'] dcase_test_frame_y = data['dcase_test_frame_y'] dcase_test_outputs = [] dcase_test_outputs.append(data['dcase_test_global_prob']) dcase_test_outputs.append(data['dcase_test_frame_prob']) if args.pooling == 'att': dcase_test_outputs.append(data['dcase_test_frame_att']) gas_eval_y = data['gas_eval_y'] gas_eval_global_prob = data['gas_eval_global_prob'] else: import torch import torch.nn as nn from torch.optim import * from torch.optim.lr_scheduler import * from torch.autograd import Variable from Net import Net from util_in import * # Load model args.kernel_size = tuple(int(x) for x in args.kernel_size.split('x')) model = Net(args).cuda() model.load_state_dict(torch.load(MODEL_FILE)['model']) model.eval() # Load DCASE data dcase_valid_x, dcase_valid_y, _ = bulk_load('DCASE_valid') dcase_test_x, dcase_test_y, dcase_test_hashes = bulk_load('DCASE_test') dcase_test_frame_y = load_dcase_test_frame_truth() DCASE_CLASS_IDS = [318, 324, 341, 321, 307, 310, 314, 397, 325, 326, 323, 319, 14, 342, 329, 331, 316] # Predict on DCASE data dcase_valid_global_prob = model.predict(dcase_valid_x, verbose = False)[:, DCASE_CLASS_IDS] dcase_thres = optimize_micro_avg_f1(dcase_valid_global_prob, dcase_valid_y) dcase_test_outputs = model.predict(dcase_test_x, verbose = True) dcase_test_outputs = tuple(x[..., DCASE_CLASS_IDS] for x in dcase_test_outputs) # Load GAS data gas_eval_x, gas_eval_y, gas_eval_hashes = bulk_load('GAS_eval') # Predict on GAS data gas_eval_global_prob = model.predict(gas_eval_x, verbose = False) # Save predictions data = {} data['dcase_thres'] = dcase_thres data['dcase_test_hashes'] = dcase_test_hashes data['dcase_test_y'] = dcase_test_y data['dcase_test_frame_y'] = dcase_test_frame_y data['dcase_test_global_prob'] = dcase_test_outputs[0] data['dcase_test_frame_prob'] = dcase_test_outputs[1] if args.pooling == 'att': data['dcase_test_frame_att'] = dcase_test_outputs[2] data['gas_eval_hashes'] = gas_eval_hashes data['gas_eval_y'] = gas_eval_y data['gas_eval_global_prob'] = gas_eval_global_prob savemat(PRED_FILE, data) # Evaluation on DCASE 2017 write_log('Performance on DCASE 2017:') write_log('') write_log(' || || Task A (recording level) || Task B (1-second segment level) ') write_log(' CLASS || THRES || TP | FN | FP | Prec. | Recall | F1 || TP | FN | FP | Prec. | Recall | F1 | Sub | Del | Ins | ER ') FORMAT1 = ' Micro Avg || || %#4d | %#4d | %#4d | %6.02f | %6.02f | %6.02f || %#4d | %#4d | %#4d | %6.02f | %6.02f | %6.02f | %#4d | %#4d | %#4d | %6.02f ' FORMAT2 = ' %######9d || %8.0006f || %#4d | %#4d | %#4d | %6.02f | %6.02f | %6.02f || %#4d | %#4d | %#4d | %6.02f | %6.02f | %6.02f | | | | ' SEP = ''.join('+' if c == '|' else '-' for c in FORMAT1) write_log(SEP) # dcase_test_y and dcase_test_frame_y are inconsistent in some places # so when you evaluate Task A, use a "fake_dcase_test_frame_y" derived from dcase_test_y fake_dcase_test_frame_y = numpy.tile(numpy.expand_dims(dcase_test_y, 1), (1, 100, 1)) # Micro-average performance across all classes res_taskA = dcase_sed_eval(dcase_test_outputs, args.pooling, dcase_thres, fake_dcase_test_frame_y, 100, verbose = True) res_taskB = dcase_sed_eval(dcase_test_outputs, args.pooling, dcase_thres, dcase_test_frame_y, 10, verbose = True) write_log(FORMAT1 % (res_taskA.TP, res_taskA.FN, res_taskA.FP, res_taskA.precision, res_taskA.recall, res_taskA.F1, res_taskB.TP, res_taskB.FN, res_taskB.FP, res_taskB.precision, res_taskB.recall, res_taskB.F1, res_taskB.sub, res_taskB.dele, res_taskB.ins, res_taskB.ER)) write_log(SEP) # Class-wise performance N_CLASSES = dcase_test_outputs[0].shape[-1] for i in range(N_CLASSES): outputs = [x[..., i:i+1] for x in dcase_test_outputs] res_taskA = dcase_sed_eval(outputs, args.pooling, dcase_thres[i], fake_dcase_test_frame_y[..., i:i+1], 100, verbose = True) res_taskB = dcase_sed_eval(outputs, args.pooling, dcase_thres[i], dcase_test_frame_y[..., i:i+1], 10, verbose = True) write_log(FORMAT2 % (i, dcase_thres[i], res_taskA.TP, res_taskA.FN, res_taskA.FP, res_taskA.precision, res_taskA.recall, res_taskA.F1, res_taskB.TP, res_taskB.FN, res_taskB.FP, res_taskB.precision, res_taskB.recall, res_taskB.F1)) # Evaluation on Google Audio Set write_log('') write_log('Performance on Google Audio Set:') write_log('') write_log(" CLASS || AP | AUC | d' ") FORMAT = ' %00007s || %5.3f | %5.3f |%6.03f ' SEP = ''.join('+' if c == '|' else '-' for c in FORMAT) write_log(SEP) classwise = [] N_CLASSES = gas_eval_global_prob.shape[-1] for i in range(N_CLASSES): classwise.append(gas_eval(gas_eval_global_prob[:,i], gas_eval_y[:,i])) # AP, AUC, dprime map, mauc = numpy.array(classwise).mean(axis = 0)[:2] write_log(FORMAT % ('Average', map, mauc, dprime(mauc))) write_log(SEP) for i in range(N_CLASSES): write_log(FORMAT % ((str(i),) + classwise[i])) ================================================ FILE: code/audioset/train.py ================================================ import sys, os, os.path, time import argparse import numpy import torch import torch.nn as nn from torch.optim import * from torch.optim.lr_scheduler import * from torch.autograd import Variable from Net import Net from util_in import * from util_out import * from util_f1 import * torch.backends.cudnn.benchmark = True # Parse input arguments def mybool(s): return s.lower() in ['t', 'true', 'y', 'yes', '1'] parser = argparse.ArgumentParser() parser.add_argument('--embedding_size', type = int, default = 1024) # this is the embedding size after a pooling layer # after a non-pooling layer, the embeddings size will be twice this much parser.add_argument('--n_conv_layers', type = int, default = 10) parser.add_argument('--kernel_size', type = str, default = '3') # 'n' or 'nxm' parser.add_argument('--n_pool_layers', type = int, default = 5) # the pooling layers will be inserted uniformly into the conv layers # the should be at least 2 and at most 6 pooling layers # the first two pooling layers will have stride (2,2); later ones will have stride (1,2) parser.add_argument('--batch_norm', type = mybool, default = True) parser.add_argument('--dropout', type = float, default = 0.0) parser.add_argument('--pooling', type = str, default = 'lin', choices = ['max', 'ave', 'lin', 'exp', 'att']) parser.add_argument('--batch_size', type = int, default = 250) parser.add_argument('--ckpt_size', type = int, default = 1000) # how many batches per checkpoint parser.add_argument('--optimizer', type = str, default = 'adam', choices = ['adam', 'sgd']) parser.add_argument('--init_lr', type = float, default = 1e-3) parser.add_argument('--lr_patience', type = int, default = 3) parser.add_argument('--lr_factor', type = float, default = 0.8) parser.add_argument('--max_ckpt', type = int, default = 30) parser.add_argument('--random_seed', type = int, default = 15213) args = parser.parse_args() if 'x' not in args.kernel_size: args.kernel_size = args.kernel_size + 'x' + args.kernel_size numpy.random.seed(args.random_seed) # Prepare log file and model directory expid = 'embed%d-%dC%dP-kernel%s-%s-drop%.1f-%s-batch%d-ckpt%d-%s-lr%.0e-pat%d-fac%.1f-seed%d' % ( args.embedding_size, args.n_conv_layers, args.n_pool_layers, args.kernel_size, 'bn' if args.batch_norm else 'nobn', args.dropout, args.pooling, args.batch_size, args.ckpt_size, args.optimizer, args.init_lr, args.lr_patience, args.lr_factor, args.random_seed ) WORKSPACE = os.path.join('../../workspace/audioset', expid) MODEL_PATH = os.path.join(WORKSPACE, 'model') if not os.path.exists(MODEL_PATH): os.makedirs(MODEL_PATH) LOG_FILE = os.path.join(WORKSPACE, 'train.log') with open(LOG_FILE, 'w'): pass def write_log(s): timestamp = time.strftime('%m-%d %H:%M:%S') msg = '[' + timestamp + '] ' + s print msg with open(LOG_FILE, 'a') as f: f.write(msg + '\n') # Load data write_log('Loading data ...') train_gen = batch_generator(batch_size = args.batch_size, random_seed = args.random_seed) gas_valid_x, gas_valid_y, _ = bulk_load('GAS_valid') gas_eval_x, gas_eval_y, _ = bulk_load('GAS_eval') dcase_valid_x, dcase_valid_y, _ = bulk_load('DCASE_valid') dcase_test_x, dcase_test_y, _ = bulk_load('DCASE_test') dcase_test_frame_truth = load_dcase_test_frame_truth() DCASE_CLASS_IDS = [318, 324, 341, 321, 307, 310, 314, 397, 325, 326, 323, 319, 14, 342, 329, 331, 316] # Build model args.kernel_size = tuple(int(x) for x in args.kernel_size.split('x')) model = Net(args).cuda() if args.optimizer == 'sgd': optimizer = SGD(model.parameters(), lr = args.init_lr, momentum = 0.9, nesterov = True) elif args.optimizer == 'adam': optimizer = Adam(model.parameters(), lr = args.init_lr) scheduler = ReduceLROnPlateau(optimizer, mode = 'max', factor = args.lr_factor, patience = args.lr_patience) if args.lr_factor < 1.0 else None criterion = nn.BCELoss() # Train model write_log('Training model ...') write_log(' || GAS_VALID || GAS_EVAL || D_VAL || DCASE_TEST ') write_log(" CKPT | LR | Tr.LOSS || MAP | MAUC | d' || MAP | MAUC | d' || Gl.F1 || Gl.F1 | Fr.ER | Fr.F1 | 1s.ER | 1s.F1 ") FORMAT = ' %#4d | %8.0003g | %8.0006f || %5.3f | %5.3f |%6.03f || %5.3f | %5.3f |%6.03f || %5.3f || %5.3f | %5.3f | %5.3f | %5.3f | %5.3f ' SEP = ''.join('+' if c == '|' else '-' for c in FORMAT) write_log(SEP) for checkpoint in range(1, args.max_ckpt + 1): # Train for args.ckpt_size batches model.train() train_loss = 0 for batch in range(1, args.ckpt_size + 1): x, y = next(train_gen) optimizer.zero_grad() global_prob = model(x)[0] global_prob.clamp_(min = 1e-7, max = 1 - 1e-7) loss = criterion(global_prob, y) train_loss += loss.data[0] if numpy.isnan(train_loss) or numpy.isinf(train_loss): break loss.backward() optimizer.step() sys.stderr.write('Checkpoint %d, Batch %d / %d, avg train loss = %f\r' % \ (checkpoint, batch, args.ckpt_size, train_loss / batch)) del x, y, global_prob, loss # This line and next line: to save GPU memory torch.cuda.empty_cache() # I don't know if they're useful or not train_loss /= args.ckpt_size # Evaluate model model.eval() sys.stderr.write('Evaluating model on GAS_VALID ...\r') global_prob = model.predict(gas_valid_x, verbose = False) gv_map, gv_mauc, gv_dprime = gas_eval(global_prob, gas_valid_y) sys.stderr.write('Evaluating model on GAS_EVAL ... \r') global_prob = model.predict(gas_eval_x, verbose = False) ge_map, ge_mauc, ge_dprime = gas_eval(global_prob, gas_eval_y) sys.stderr.write('Evaluating model on DCASE_VALID ...\r') global_prob = model.predict(dcase_valid_x, verbose = False)[:, DCASE_CLASS_IDS] thres = optimize_micro_avg_f1(global_prob, dcase_valid_y) dv_f1 = f1(global_prob >= thres, dcase_valid_y) sys.stderr.write('Evaluating model on DCASE_TEST ... \r') outputs = model.predict(dcase_test_x, verbose = True) outputs = tuple(x[..., DCASE_CLASS_IDS] for x in outputs) dt_f1 = f1(outputs[0] >= thres, dcase_test_y) dt_frame_er, dt_frame_f1 = dcase_sed_eval(outputs, args.pooling, thres, dcase_test_frame_truth, 1) dt_1s_er, dt_1s_f1 = dcase_sed_eval(outputs, args.pooling, thres, dcase_test_frame_truth, 10) # Write log write_log(FORMAT % ( checkpoint, optimizer.param_groups[0]['lr'], train_loss, gv_map, gv_mauc, gv_dprime, ge_map, ge_mauc, ge_dprime, dv_f1, dt_f1, dt_frame_er, dt_frame_f1, dt_1s_er, dt_1s_f1 )) # Abort if training has gone mad if numpy.isnan(train_loss) or numpy.isinf(train_loss): write_log('Aborted.') break # Save model. Too bad I can't save the scheduler MODEL_FILE = os.path.join(MODEL_PATH, 'checkpoint%d.pt' % checkpoint) state = {'model': model.state_dict(), 'optimizer': optimizer.state_dict()} sys.stderr.write('Saving model to %s ...\r' % MODEL_FILE) torch.save(state, MODEL_FILE) # Update learning rate if scheduler is not None: scheduler.step(gv_map) write_log('DONE!') ================================================ FILE: code/audioset/util_f1.py ================================================ import numpy # Compute F1 given predictions and truth def f1(pred, truth): return 2.0 * (pred & truth).sum() / (pred.sum() + truth.sum()) # Given scores and truth for a single class (as 1-D numpy arrays), find optimal threshold and corresponding F1 # Statistics of other classes may be given to optimize micro-average F1 def optimize_f1(scores, truth, extraNcorr = 0, extraNtrue = 0, extraNpred = 0): # Start with predicting everything as negative best_thres = numpy.inf best_f1 = 0.0 num = extraNcorr # number of correctly predicted instances den = extraNtrue + extraNpred + truth.sum() # number of predicted instances + true instances instances = [(-numpy.inf, False)] + sorted(zip(scores, truth)) # Lower the threshold gradually for i in range(len(instances) - 1, 0, -1): if instances[i][1]: num += 1 den += 1 if instances[i][0] > instances[i-1][0]: # Can put threshold here f1 = 2.0 * num / den if f1 > best_f1: best_thres = (instances[i][0] + instances[i-1][0]) / 2 best_f1 = f1 return best_thres, best_f1 # Given scores and truth for many classes (as 2-D numpy arrays), # find the optimal class-specific thresholds (as a 1-D numpy array) that maximizes the micro-average F1 # The algorithm is stochastic, but I have always observed deterministic results def optimize_micro_avg_f1(scores, truth): # First optimize each class individually nClasses = truth.shape[1] thres = numpy.zeros(nClasses, dtype = 'float64') for i in range(nClasses): thres[i], _ = optimize_f1(scores[:,i], truth[:,i]) Ntrue = truth.sum(axis = 0) Npred = (scores >= thres).sum(axis = 0) Ncorr = ((scores >= thres) & truth).sum(axis = 0) # Repeatly re-tune the threshold for each class until convergence candidates = range(nClasses) while len(candidates) > 0: i = numpy.random.choice(candidates) candidates.remove(i) old_thres = thres[i] thres[i], _ = optimize_f1( scores[:,i], truth[:,i], extraNcorr = Ncorr.sum() - Ncorr[i], extraNtrue = Ntrue.sum() - Ntrue[i], extraNpred = Npred.sum() - Npred[i], ) if thres[i] != old_thres: Npred[i] = (scores[:,i] >= thres[i]).sum(axis = 0) Ncorr[i] = ((scores[:,i] >= thres[i]) & truth[:,i]).sum(axis = 0) candidates = range(nClasses) candidates.remove(i) return thres ================================================ FILE: code/audioset/util_in.py ================================================ import sys, os, os.path, glob import cPickle from scipy.io import loadmat import numpy from multiprocessing import Process, Queue import torch from torch.autograd import Variable N_CLASSES = 527 N_WORKERS = 6 GAS_FEATURE_DIR = '../../data/audioset' DCASE_FEATURE_DIR = '../../data/dcase' with open(os.path.join(GAS_FEATURE_DIR, 'normalizer.pkl'), 'rb') as f: mu, sigma = cPickle.load(f) def sample_generator(file_list, random_seed = 15213): rng = numpy.random.RandomState(random_seed) while True: rng.shuffle(file_list) for filename in file_list: data = loadmat(filename) feat = ((data['feat'] - mu) / sigma).astype('float32') labels = data['labels'].astype('float32') for i in range(len(data['feat'])): yield feat[i], labels[i] def worker(queues, file_lists, random_seed): generators = [sample_generator(file_lists[i], random_seed + i) for i in range(len(file_lists))] while True: for gen, q in zip(generators, queues): q.put(next(gen)) def batch_generator(batch_size, random_seed = 15213): queues = [Queue(5) for class_id in range(N_CLASSES)] file_lists = [sorted(glob.glob(os.path.join(GAS_FEATURE_DIR, 'GAS_train_unbalanced_class%03d_part*.mat' % class_id))) for class_id in range(N_CLASSES)] for worker_id in range(N_WORKERS): p = Process(target = worker, args = (queues[worker_id::N_WORKERS], file_lists[worker_id::N_WORKERS], random_seed)) p.daemon = True p.start() rng = numpy.random.RandomState(random_seed) batch = [] while True: rng.shuffle(queues) for q in queues: batch.append(q.get()) if len(batch) == batch_size: yield tuple(Variable(torch.from_numpy(numpy.stack(x))).cuda() for x in zip(*batch)) batch = [] def bulk_load(prefix): feat = []; labels = []; hashes = [] for filename in sorted(glob.glob(os.path.join(GAS_FEATURE_DIR, '%s_*.mat' % prefix)) + glob.glob(os.path.join(DCASE_FEATURE_DIR, '%s_*.mat' % prefix))): data = loadmat(filename) feat.append(((data['feat'] - mu) / sigma).astype('float32')) labels.append(data['labels'].astype('bool')) hashes.append(data['hashes']) return numpy.concatenate(feat), numpy.concatenate(labels), numpy.concatenate(hashes) def load_dcase_test_frame_truth(): return cPickle.load(open(os.path.join(DCASE_FEATURE_DIR, 'DCASE_test_frame_label.pkl'), 'rb')) ================================================ FILE: code/audioset/util_out.py ================================================ from scipy import stats import numpy def roc(pred, truth): data = numpy.array(sorted(zip(pred, truth), reverse = True)) pred, truth = data[:,0], data[:,1].astype("bool") TP = truth.cumsum() FP = (1 - truth).cumsum() mask = numpy.concatenate([numpy.diff(pred) < 0, numpy.array([True])]) TP = numpy.concatenate([numpy.array([0]), TP[mask]]) FP = numpy.concatenate([numpy.array([0]), FP[mask]]) return TP, FP def ap_and_auc(pred, truth): TP, FP = roc(pred, truth) auc = ((TP[1:] + TP[:-1]) * numpy.diff(FP)).sum() / (2 * TP[-1] * FP[-1]) precision = TP[1:] / (TP + FP)[1:] weight = numpy.diff(TP) ap = (precision * weight).sum() / TP[-1] return ap, auc def dprime(auc): return stats.norm().ppf(auc) * numpy.sqrt(2.0) def gas_eval(pred, truth): if truth.ndim == 1: ap, auc = ap_and_auc(pred, truth) else: ap, auc = numpy.array([ap_and_auc(pred[:,i], truth[:,i]) for i in range(truth.shape[1]) if truth[:,i].any()]).mean(axis = 0) return ap, auc, dprime(auc) def dcase_sed_eval(outputs, pooling, thres, truth, seg_len, verbose = False): pred = outputs[1].reshape((-1, seg_len, outputs[1].shape[-1])) if pooling == 'max': seg_prob = pred.max(axis = 1) elif pooling == 'ave': seg_prob = pred.mean(axis = 1) elif pooling == 'lin': seg_prob = (pred * pred).sum(axis = 1) / pred.sum(axis = 1) elif pooling == 'exp': seg_prob = (pred * numpy.exp(pred)).sum(axis = 1) / numpy.exp(pred).sum(axis = 1) elif pooling == 'att': att = outputs[2].reshape((-1, seg_len, outputs[2].shape[-1])) seg_prob = (pred * att).sum(axis = 1) / att.sum(axis = 1) pred = seg_prob >= thres truth = truth.reshape((-1, seg_len, truth.shape[-1])).max(axis = 1) if not verbose: Ntrue = truth.sum(axis = 1) Npred = pred.sum(axis = 1) Ncorr = (truth & pred).sum(axis = 1) Nmiss = Ntrue - Ncorr Nfa = Npred - Ncorr error_rate = 1.0 * numpy.maximum(Nmiss, Nfa).sum() / Ntrue.sum() f1 = 2.0 * Ncorr.sum() / (Ntrue + Npred).sum() return error_rate, f1 else: class Object(object): pass res = Object() res.TP = (truth & pred).sum() res.FN = (truth & ~pred).sum() res.FP = (~truth & pred).sum() res.precision = 100.0 * res.TP / (res.TP + res.FP) res.recall = 100.0 * res.TP / (res.TP + res.FN) res.F1 = 200.0 * res.TP / (2 * res.TP + res.FP + res.FN) res.sub = numpy.minimum((truth & ~pred).sum(axis = 1), (~truth & pred).sum(axis = 1)).sum() res.dele = res.FN - res.sub res.ins = res.FP - res.sub res.ER = 100.0 * (res.sub + res.dele + res.ins) / (res.TP + res.FN) return res ================================================ FILE: code/dcase/Net.py ================================================ import torch import torch.nn as nn import torch.nn.functional as F from torch.autograd import Variable import numpy class Net(nn.Module): def __init__(self, args): super(Net, self).__init__() self.pooling = args.pooling self.dropout = args.dropout self.conv1 = nn.Conv2d(1, 32, (5, 5), padding = (2, 2)) # (1, 400, 64) -> (32, 400, 64) self.conv2 = nn.Conv2d(32, 64, (5, 5), padding = (2, 2)) # (32, 400, 32) -> (64, 400, 32) self.conv3 = nn.Conv2d(64, 128, (5, 5), padding = (2, 2)) # (64, 200, 16) -> (128, 200, 16) self.gru = nn.GRU(1024, 100, 1, batch_first = True, bidirectional = True) self.fc_prob = nn.Linear(200, 17) if self.pooling == 'att': self.fc_att = nn.Linear(200, 17) # Better initialization nn.init.xavier_uniform(self.conv1.weight); nn.init.constant(self.conv1.bias, 0) nn.init.xavier_uniform(self.conv2.weight); nn.init.constant(self.conv2.bias, 0) nn.init.xavier_uniform(self.conv3.weight); nn.init.constant(self.conv3.bias, 0) nn.init.orthogonal(self.gru.weight_ih_l0); nn.init.constant(self.gru.bias_ih_l0, 0) nn.init.orthogonal(self.gru.weight_hh_l0); nn.init.constant(self.gru.bias_hh_l0, 0) nn.init.orthogonal(self.gru.weight_ih_l0_reverse); nn.init.constant(self.gru.bias_ih_l0_reverse, 0) nn.init.orthogonal(self.gru.weight_hh_l0_reverse); nn.init.constant(self.gru.bias_hh_l0_reverse, 0) nn.init.xavier_uniform(self.fc_prob.weight); nn.init.constant(self.fc_prob.bias, 0) if self.pooling == 'att': nn.init.xavier_uniform(self.fc_att.weight); nn.init.constant(self.fc_att.bias, 0) def forward(self, x): # shape of x: (batch, time, frequency) = (batch, 400, 64) x = x.view((-1, 1, x.size(1), x.size(2))) # x becomes (batch, channel, time, frequency) = (batch, 1, 400, 64) if self.dropout > 0: x = F.dropout(x, p = self.dropout, training = self.training) x = F.max_pool2d(F.relu(self.conv1(x)), (1, 2)) # (batch, 32, 400, 32) if self.dropout > 0: x = F.dropout(x, p = self.dropout, training = self.training) x = F.max_pool2d(F.relu(self.conv2(x)), (2, 2)) # (batch, 64, 200, 16) if self.dropout > 0: x = F.dropout(x, p = self.dropout, training = self.training) x = F.max_pool2d(F.relu(self.conv3(x)), (2, 2)) # (batch, 128, 100, 8) x = x.permute(0, 2, 1, 3).contiguous() # x becomes (batch, time, channel, frequency) = (batch, 100, 128, 8) x = x.view((-1, x.size(1), x.size(2) * x.size(3))) # x becomes (batch, time, channel * frequency) = (batch, 100, 1024) if self.dropout > 0: x = F.dropout(x, p = self.dropout, training = self.training) x, _ = self.gru(x) # (batch, 100, 200) if self.dropout > 0: x = F.dropout(x, p = self.dropout, training = self.training) frame_prob = F.sigmoid(self.fc_prob(x)) # shape of frame_prob: (batch, time, class) = (batch, 100, 17) if self.pooling == 'max': global_prob, _ = frame_prob.max(dim = 1) return global_prob, frame_prob elif self.pooling == 'ave': global_prob = frame_prob.mean(dim = 1) return global_prob, frame_prob elif self.pooling == 'lin': global_prob = (frame_prob * frame_prob).sum(dim = 1) / frame_prob.sum(dim = 1) return global_prob, frame_prob elif self.pooling == 'exp': global_prob = (frame_prob * frame_prob.exp()).sum(dim = 1) / frame_prob.exp().sum(dim = 1) return global_prob, frame_prob elif self.pooling == 'att': frame_att = F.softmax(self.fc_att(x), dim = 1) global_prob = (frame_prob * frame_att).sum(dim = 1) return global_prob, frame_prob, frame_att def predict(self, x, verbose = True, batch_size = 100): # Predict in batches. Both input and output are numpy arrays. # If verbose == True, return all of global_prob, frame_prob and att # If verbose == False, only return global_prob result = [] for i in range(0, len(x), batch_size): with torch.no_grad(): input = Variable(torch.from_numpy(x[i : i + batch_size])).cuda() output = self.forward(input) if not verbose: output = output[:1] result.append([var.data.cpu().numpy() for var in output]) result = tuple(numpy.concatenate(items) for items in zip(*result)) return result if verbose else result[0] ================================================ FILE: code/dcase/eval.py ================================================ import sys, os, os.path import argparse import numpy from util_out import * from util_f1 import * from scipy.io import loadmat, savemat # Parse input arguments parser = argparse.ArgumentParser(description = '') parser.add_argument('--pooling', type = str, default = 'lin', choices = ['max', 'ave', 'lin', 'exp', 'att']) parser.add_argument('--dropout', type = float, default = 0.0) parser.add_argument('--batch_size', type = int, default = 100) parser.add_argument('--ckpt_size', type = int, default = 500) parser.add_argument('--optimizer', type = str, default = 'adam', choices = ['adam', 'sgd']) parser.add_argument('--init_lr', type = float, default = 3e-4) parser.add_argument('--lr_patience', type = int, default = 3) parser.add_argument('--lr_factor', type = float, default = 0.5) parser.add_argument('--random_seed', type = int, default = 15213) parser.add_argument('--ckpt', type = int) args = parser.parse_args() # Locate model file and prepare directories for prediction and evaluation expid = '%s-drop%.1f-batch%d-ckpt%d-%s-lr%.0e-pat%d-fac%.1f-seed%d' % ( args.pooling, args.dropout, args.batch_size, args.ckpt_size, args.optimizer, args.init_lr, args.lr_patience, args.lr_factor, args.random_seed ) WORKSPACE = os.path.join('../../workspace/dcase', expid) MODEL_FILE = os.path.join(WORKSPACE, 'model', 'checkpoint%d.pt' % args.ckpt) PRED_PATH = os.path.join(WORKSPACE, 'pred') if not os.path.exists(PRED_PATH): os.makedirs(PRED_PATH) PRED_FILE = os.path.join(PRED_PATH, 'checkpoint%d.mat' % args.ckpt) EVAL_PATH = os.path.join(WORKSPACE, 'eval') if not os.path.exists(EVAL_PATH): os.makedirs(EVAL_PATH) EVAL_FILE = os.path.join(EVAL_PATH, 'checkpoint%d.txt' % args.ckpt) with open(EVAL_FILE, 'w'): pass def write_log(s): print s with open(EVAL_FILE, 'a') as f: f.write(s + '\n') if os.path.exists(PRED_FILE): # Load saved predictions, no need to use GPU data = loadmat(PRED_FILE) thres = data['thres'].ravel() test_y = data['test_y'] test_frame_y = data['test_frame_y'] test_outputs = [] test_outputs.append(data['test_global_prob']) test_outputs.append(data['test_frame_prob']) if args.pooling == 'att': test_outputs.append(data['test_frame_att']) else: import torch import torch.nn as nn from torch.optim import * from torch.optim.lr_scheduler import * from torch.autograd import Variable from Net import Net from util_in import * # Load model model = Net(args).cuda() model.load_state_dict(torch.load(MODEL_FILE)['model']) model.eval() # Load data valid_x, valid_y, _ = bulk_load('DCASE_valid') test_x, test_y, test_hashes = bulk_load('DCASE_test') test_frame_y = load_dcase_test_frame_truth() # Predict valid_global_prob = model.predict(valid_x, verbose = False) thres = optimize_micro_avg_f1(valid_global_prob, valid_y) test_outputs = model.predict(test_x, verbose = True) # Save predictions data = {} data['thres'] = thres data['test_hashes'] = test_hashes data['test_y'] = test_y data['test_frame_y'] = test_frame_y data['test_global_prob'] = test_outputs[0] data['test_frame_prob'] = test_outputs[1] if args.pooling == 'att': data['test_frame_att'] = test_outputs[2] savemat(PRED_FILE, data) # Evaluation write_log(' || || Task A (recording level) || Task B (1-second segment level) ') write_log(' CLASS || THRES || TP | FN | FP | Prec. | Recall | F1 || TP | FN | FP | Prec. | Recall | F1 | Sub | Del | Ins | ER ') FORMAT1 = ' Micro Avg || || %#4d | %#4d | %#4d | %6.02f | %6.02f | %6.02f || %#4d | %#4d | %#4d | %6.02f | %6.02f | %6.02f | %#4d | %#4d | %#4d | %6.02f ' FORMAT2 = ' %######9d || %8.0006f || %#4d | %#4d | %#4d | %6.02f | %6.02f | %6.02f || %#4d | %#4d | %#4d | %6.02f | %6.02f | %6.02f | | | | ' SEP = ''.join('+' if c == '|' else '-' for c in FORMAT1) write_log(SEP) # test_y and test_frame_y are inconsistent in some places # so when you evaluate Task A, use a "fake_test_frame_y" derived from test_y fake_test_frame_y = numpy.tile(numpy.expand_dims(test_y, 1), (1, 100, 1)) # Micro-average performance across all classes res_taskA = dcase_sed_eval(test_outputs, args.pooling, thres, fake_test_frame_y, 100, verbose = True) res_taskB = dcase_sed_eval(test_outputs, args.pooling, thres, test_frame_y, 10, verbose = True) write_log(FORMAT1 % (res_taskA.TP, res_taskA.FN, res_taskA.FP, res_taskA.precision, res_taskA.recall, res_taskA.F1, res_taskB.TP, res_taskB.FN, res_taskB.FP, res_taskB.precision, res_taskB.recall, res_taskB.F1, res_taskB.sub, res_taskB.dele, res_taskB.ins, res_taskB.ER)) write_log(SEP) # Class-wise performance N_CLASSES = test_outputs[0].shape[-1] for i in range(N_CLASSES): outputs = [x[..., i:i+1] for x in test_outputs] res_taskA = dcase_sed_eval(outputs, args.pooling, thres[i], fake_test_frame_y[..., i:i+1], 100, verbose = True) res_taskB = dcase_sed_eval(outputs, args.pooling, thres[i], test_frame_y[..., i:i+1], 10, verbose = True) write_log(FORMAT2 % (i, thres[i], res_taskA.TP, res_taskA.FN, res_taskA.FP, res_taskA.precision, res_taskA.recall, res_taskA.F1, res_taskB.TP, res_taskB.FN, res_taskB.FP, res_taskB.precision, res_taskB.recall, res_taskB.F1)) ================================================ FILE: code/dcase/train.py ================================================ import sys, os, os.path, time import argparse import numpy import torch import torch.nn as nn from torch.optim import * from torch.optim.lr_scheduler import * from torch.autograd import Variable from Net import Net from util_in import * from util_out import * from util_f1 import * torch.backends.cudnn.benchmark = True # Parse input arguments parser = argparse.ArgumentParser(description = '') parser.add_argument('--pooling', type = str, default = 'lin', choices = ['max', 'ave', 'lin', 'exp', 'att']) parser.add_argument('--dropout', type = float, default = 0.0) parser.add_argument('--batch_size', type = int, default = 100) parser.add_argument('--ckpt_size', type = int, default = 500) parser.add_argument('--optimizer', type = str, default = 'adam', choices = ['adam', 'sgd']) parser.add_argument('--init_lr', type = float, default = 3e-4) parser.add_argument('--lr_patience', type = int, default = 3) parser.add_argument('--lr_factor', type = float, default = 0.5) parser.add_argument('--max_ckpt', type = int, default = 50) parser.add_argument('--random_seed', type = int, default = 15213) args = parser.parse_args() numpy.random.seed(args.random_seed) # Prepare log file and model directory expid = '%s-drop%.1f-batch%d-ckpt%d-%s-lr%.0e-pat%d-fac%.1f-seed%d' % ( args.pooling, args.dropout, args.batch_size, args.ckpt_size, args.optimizer, args.init_lr, args.lr_patience, args.lr_factor, args.random_seed ) WORKSPACE = os.path.join('../../workspace/dcase', expid) MODEL_PATH = os.path.join(WORKSPACE, 'model') if not os.path.exists(MODEL_PATH): os.makedirs(MODEL_PATH) LOG_FILE = os.path.join(WORKSPACE, 'train.log') with open(LOG_FILE, 'w'): pass def write_log(s): timestamp = time.strftime('%Y-%m-%d %H:%M:%S') msg = '[' + timestamp + '] ' + s print msg with open(LOG_FILE, 'a') as f: f.write(msg + '\n') # Load data write_log('Loading data ...') valid_x, valid_y, _ = bulk_load('DCASE_valid') test_x, test_y, _ = bulk_load('DCASE_test') test_frame_y = load_dcase_test_frame_truth() # Build model write_log('Building model ...') model = Net(args).cuda() if args.optimizer == 'sgd': optimizer = SGD(model.parameters(), lr = args.init_lr, momentum = 0.9, nesterov = True) elif args.optimizer == 'adam': optimizer = Adam(model.parameters(), lr = args.init_lr) if args.lr_factor < 1.0: scheduler = ReduceLROnPlateau(optimizer, mode = 'min', factor = args.lr_factor, patience = args.lr_patience) criterion = nn.BCELoss() def bce_loss(input, target): return -numpy.log(numpy.where(target, input, 1 - input)).sum() / input.size # Train model write_log('Training model ...') write_log(' || D_VAL || DCASE_TEST ') write_log(' CKPT | LR | Tr.LOSS | Val.LOSS || Gl.F1 || Gl.F1 | Fr.ER | Fr.F1 | 1s.ER | 1s.F1 ') FORMAT = ' %#4d | %8.0003g | %8.0006f | %8.0006f || %5.3f || %5.3f | %5.3f | %5.3f | %5.3f | %5.3f ' SEP = ''.join('+' if c == '|' else '-' for c in FORMAT) write_log(SEP) gen_train = batch_generator(args.batch_size, args.random_seed) for ckpt in range(1, args.max_ckpt + 1): model.train() train_loss = 0 for i in range(args.ckpt_size): x, y = next(gen_train) optimizer.zero_grad() global_prob = model(x)[0] global_prob.clamp_(min = 1e-7, max = 1 - 1e-7) loss = criterion(global_prob, y) train_loss += loss.data[0] loss.backward() optimizer.step() sys.stderr.write('Checkpoint %d, Batch %d / %d, avg train loss = %f\r' % (ckpt, i + 1, args.ckpt_size, train_loss / (i + 1))) train_loss /= args.ckpt_size # Compute validation loss, validation F1 and test F1 model.eval() valid_global_prob = model.predict(valid_x, verbose = False) valid_loss = bce_loss(valid_global_prob, valid_y) thres = optimize_micro_avg_f1(valid_global_prob, valid_y) valid_global_f1 = f1(valid_global_prob >= thres, valid_y) test_outputs = model.predict(test_x, verbose = True) test_global_f1 = f1(test_outputs[0] >= thres, test_y) test_frame_er, test_frame_f1 = dcase_sed_eval(test_outputs, args.pooling, thres, test_frame_y, 1) # every 1 frame is a segment test_1s_er, test_1s_f1 = dcase_sed_eval(test_outputs, args.pooling, thres, test_frame_y, 10) # every 10 frame is a segment # Write log write_log(FORMAT % ( ckpt, optimizer.param_groups[0]['lr'], train_loss, valid_loss, valid_global_f1, test_global_f1, test_frame_er, test_frame_f1, test_1s_er, test_1s_f1 )) # Abort if training has gone mad if numpy.isnan(train_loss) or numpy.isinf(train_loss): write_log('Aborted.') break # Save model. Too bad I can't save the scheduler MODEL_FILE = os.path.join(MODEL_PATH, 'checkpoint%d.pt' % ckpt) state = {'model': model.state_dict(), 'optimizer': optimizer.state_dict()} torch.save(state, MODEL_FILE) # Update learning rate if args.lr_factor < 1.0: scheduler.step(valid_loss) write_log('DONE!') ================================================ FILE: code/dcase/util_f1.py ================================================ import numpy # Compute F1 given predictions and truth def f1(pred, truth): return 2.0 * (pred & truth).sum() / (pred.sum() + truth.sum()) # Given scores and truth for a single class (as 1-D numpy arrays), find optimal threshold and corresponding F1 # Statistics of other classes may be given to optimize micro-average F1 def optimize_f1(scores, truth, extraNcorr = 0, extraNtrue = 0, extraNpred = 0): # Start with predicting everything as negative best_thres = numpy.inf best_f1 = 0.0 num = extraNcorr # number of correctly predicted instances den = extraNtrue + extraNpred + truth.sum() # number of predicted instances + true instances instances = [(-numpy.inf, False)] + sorted(zip(scores, truth)) # Lower the threshold gradually for i in range(len(instances) - 1, 0, -1): if instances[i][1]: num += 1 den += 1 if instances[i][0] > instances[i-1][0]: # Can put threshold here f1 = 2.0 * num / den if f1 > best_f1: best_thres = (instances[i][0] + instances[i-1][0]) / 2 best_f1 = f1 return best_thres, best_f1 # Given scores and truth for many classes (as 2-D numpy arrays), # find the optimal class-specific thresholds (as a 1-D numpy array) that maximizes the micro-average F1 # The algorithm is stochastic, but I have always observed deterministic results def optimize_micro_avg_f1(scores, truth): # First optimize each class individually nClasses = truth.shape[1] thres = numpy.zeros(nClasses, dtype = 'float64') for i in range(nClasses): thres[i], _ = optimize_f1(scores[:,i], truth[:,i]) Ntrue = truth.sum(axis = 0) Npred = (scores >= thres).sum(axis = 0) Ncorr = ((scores >= thres) & truth).sum(axis = 0) # Repeatly re-tune the threshold for each class until convergence candidates = range(nClasses) while len(candidates) > 0: i = numpy.random.choice(candidates) candidates.remove(i) old_thres = thres[i] thres[i], _ = optimize_f1( scores[:,i], truth[:,i], extraNcorr = Ncorr.sum() - Ncorr[i], extraNtrue = Ntrue.sum() - Ntrue[i], extraNpred = Npred.sum() - Npred[i], ) if thres[i] != old_thres: Npred[i] = (scores[:,i] >= thres[i]).sum(axis = 0) Ncorr[i] = ((scores[:,i] >= thres[i]) & truth[:,i]).sum(axis = 0) candidates = range(nClasses) candidates.remove(i) return thres ================================================ FILE: code/dcase/util_in.py ================================================ import sys, os, os.path, glob import cPickle from scipy.io import loadmat import numpy from multiprocessing import Process, Queue import torch from torch.autograd import Variable N_CLASSES = 17 N_WORKERS = 6 FEATURE_DIR = '../../data/dcase' with open(os.path.join(FEATURE_DIR, 'normalizer.pkl'), 'rb') as f: mu, sigma = cPickle.load(f) def sample_generator(file_list, random_seed = 15213): rng = numpy.random.RandomState(random_seed) while True: rng.shuffle(file_list) for filename in file_list: data = loadmat(filename) feat = ((data['feat'] - mu) / sigma).astype('float32') labels = data['labels'].astype('float32') for i in range(len(data['feat'])): yield feat[i], labels[i] def worker(queues, file_lists, random_seed): generators = [sample_generator(file_lists[i], random_seed + i) for i in range(len(file_lists))] while True: for gen, q in zip(generators, queues): q.put(next(gen)) def batch_generator(batch_size, random_seed = 15213): queues = [Queue(5) for class_id in range(N_CLASSES)] file_lists = [sorted(glob.glob(os.path.join(FEATURE_DIR, 'DCASE_train_class%02d_part*.mat' % class_id))) for class_id in range(N_CLASSES)] for worker_id in range(N_WORKERS): p = Process(target = worker, args = (queues[worker_id::N_WORKERS], file_lists[worker_id::N_WORKERS], random_seed)) p.daemon = True p.start() rng = numpy.random.RandomState(random_seed) batch = [] while True: rng.shuffle(queues) for q in queues: batch.append(q.get()) if len(batch) == batch_size: yield tuple(Variable(torch.from_numpy(numpy.stack(x))).cuda() for x in zip(*batch)) batch = [] def bulk_load(prefix): feat = []; labels = []; hashes = [] for filename in sorted(glob.glob(os.path.join(FEATURE_DIR, '%s_*.mat' % prefix))): data = loadmat(filename) feat.append(((data['feat'] - mu) / sigma).astype('float32')) labels.append(data['labels'].astype('bool')) hashes.append(data['hashes']) return numpy.concatenate(feat), numpy.concatenate(labels), numpy.concatenate(hashes) def load_dcase_test_frame_truth(): return cPickle.load(open(os.path.join(FEATURE_DIR, 'DCASE_test_frame_label.pkl'), 'rb')) ================================================ FILE: code/dcase/util_out.py ================================================ import numpy def dcase_sed_eval(outputs, pooling, thres, truth, seg_len, verbose = False): pred = outputs[1].reshape((-1, seg_len, outputs[1].shape[-1])) if pooling == 'max': seg_prob = pred.max(axis = 1) elif pooling == 'ave': seg_prob = pred.mean(axis = 1) elif pooling == 'lin': seg_prob = (pred * pred).sum(axis = 1) / pred.sum(axis = 1) elif pooling == 'exp': seg_prob = (pred * numpy.exp(pred)).sum(axis = 1) / numpy.exp(pred).sum(axis = 1) elif pooling == 'att': att = outputs[2].reshape((-1, seg_len, outputs[2].shape[-1])) seg_prob = (pred * att).sum(axis = 1) / att.sum(axis = 1) pred = seg_prob >= thres truth = truth.reshape((-1, seg_len, truth.shape[-1])).max(axis = 1) if not verbose: Ntrue = truth.sum(axis = 1) Npred = pred.sum(axis = 1) Ncorr = (truth & pred).sum(axis = 1) Nmiss = Ntrue - Ncorr Nfa = Npred - Ncorr error_rate = 1.0 * numpy.maximum(Nmiss, Nfa).sum() / Ntrue.sum() f1 = 2.0 * Ncorr.sum() / (Ntrue + Npred).sum() return error_rate, f1 else: class Object(object): pass res = Object() res.TP = (truth & pred).sum() res.FN = (truth & ~pred).sum() res.FP = (~truth & pred).sum() res.precision = 100.0 * res.TP / (res.TP + res.FP) res.recall = 100.0 * res.TP / (res.TP + res.FN) res.F1 = 200.0 * res.TP / (2 * res.TP + res.FP + res.FN) res.sub = numpy.minimum((truth & ~pred).sum(axis = 1), (~truth & pred).sum(axis = 1)).sum() res.dele = res.FN - res.sub res.ins = res.FP - res.sub res.ER = 100.0 * (res.sub + res.dele + res.ins) / (res.TP + res.FN) return res ================================================ FILE: code/sequential/Net.py ================================================ import torch import torch.nn as nn import torch.nn.functional as F from torch.autograd import Variable import numpy class ConvBlock(nn.Module): def __init__(self, n_input_feature_maps, n_output_feature_maps, kernel_size_2d, batch_norm = False, pool_stride = None): super(ConvBlock, self).__init__() assert all(x % 2 == 1 for x in kernel_size_2d) self.n_input = n_input_feature_maps self.n_output = n_output_feature_maps self.kernel_size = kernel_size_2d self.batch_norm = batch_norm self.pool_stride = pool_stride # "~batch_norm" should be written as "not batch_norm"; otherwise ~True will evaluate to -2 and be treated as True. # But I'll keep this error to avoid breaking existing models. self.conv = nn.Conv2d(self.n_input, self.n_output, self.kernel_size, padding = tuple(x/2 for x in self.kernel_size), bias = ~batch_norm) if batch_norm: self.bn = nn.BatchNorm2d(self.n_output) nn.init.xavier_uniform(self.conv.weight) def forward(self, x): x = self.conv(x) if self.batch_norm: x = self.bn(x) x = F.relu(x) if self.pool_stride is not None: x = F.max_pool2d(x, self.pool_stride) return x class Net(nn.Module): def __init__(self, args): super(Net, self).__init__() self.__dict__.update(args.__dict__) # Instill all args into self assert self.n_conv_layers % self.n_pool_layers == 0 self.input_n_freq_bins = n_freq_bins = 64 self.output_size = 71 if self.mode == 'ctc' else 35 self.conv = [] pool_interval = self.n_conv_layers / self.n_pool_layers n_input = 1 for i in range(self.n_conv_layers): if (i + 1) % pool_interval == 0: # this layer has pooling n_freq_bins /= 2 n_output = self.embedding_size / n_freq_bins pool_stride = (2, 2) if i < pool_interval * 2 else (1, 2) else: n_output = self.embedding_size * 2 / n_freq_bins pool_stride = None layer = ConvBlock(n_input, n_output, self.kernel_size, batch_norm = self.batch_norm, pool_stride = pool_stride) self.conv.append(layer) self.__setattr__('conv' + str(i + 1), layer) n_input = n_output self.gru = nn.GRU(self.embedding_size, self.embedding_size / 2, 1, batch_first = True, bidirectional = True) self.fc = nn.Linear(self.embedding_size, self.output_size) # Better initialization nn.init.orthogonal(self.gru.weight_ih_l0); nn.init.constant(self.gru.bias_ih_l0, 0) nn.init.orthogonal(self.gru.weight_hh_l0); nn.init.constant(self.gru.bias_hh_l0, 0) nn.init.orthogonal(self.gru.weight_ih_l0_reverse); nn.init.constant(self.gru.bias_ih_l0_reverse, 0) nn.init.orthogonal(self.gru.weight_hh_l0_reverse); nn.init.constant(self.gru.bias_hh_l0_reverse, 0) nn.init.xavier_uniform(self.fc.weight); nn.init.constant(self.fc.bias, 0) def forward(self, x): x = x.view((-1, 1, x.size(1), x.size(2))) # x becomes (batch, channel, time, freq) for i in range(len(self.conv)): if self.dropout > 0: x = F.dropout(x, p = self.dropout, training = self.training) x = self.conv[i](x) # x becomes (batch, channel, time, freq) x = x.permute(0, 2, 1, 3).contiguous() # x becomes (batch, time, channel, freq) x = x.view((-1, x.size(1), x.size(2) * x.size(3))) # x becomes (batch, time, embedding_size) if self.dropout > 0: x = F.dropout(x, p = self.dropout, training = self.training) x, _ = self.gru(x) # x becomes (batch, time, embedding_size) if self.dropout > 0: x = F.dropout(x, p = self.dropout, training = self.training) if self.mode == 'ctc': log_prob = F.log_softmax(self.fc(x), dim = -1) # shape of log_prob: (batch, time, output_size) return log_prob # returns the log probability else: frame_prob = F.sigmoid(self.fc(x)) # shape of frame_prob: (batch, time, output_size) frame_prob = torch.clamp(frame_prob, 1e-7, 1 - 1e-7) return frame_prob def predict(self, x, batch_size = 300): # Predict in batches. Both input and output are numpy arrays. result = [] for i in range(0, len(x), batch_size): with torch.no_grad(): input = Variable(torch.from_numpy(x[i : i + batch_size])).cuda() output = self.forward(input) result.append(output.data.cpu().numpy()) return numpy.concatenate(result) ================================================ FILE: code/sequential/ctc.py ================================================ import numpy numpy.seterr(divide = 'ignore') import torch from torch.autograd import Variable def logsumexp(*args): M = reduce(torch.max, args) mask = M != -numpy.inf M[mask] += torch.log(sum(torch.exp(x[mask] - M[mask]) for x in args)) # Must pick the valid part out, otherwise the gradient will contain NaNs return M # Input arguments: # logProb: a 3-D Variable of size N_SEQS * N_FRAMES * N_LABELS containing LOG probabilities. # seqLen: a list or numpy array indicating the number of valid frames in each sequence. # label: a list of label sequences. # Note on implementation: # Anything that will be backpropped must be a Variable; # Anything used as an index must be a torch.cuda.LongTensor. def ctc_loss(logProb, seqLen, label, debug = False): seqLen = numpy.array(seqLen) nSeqs, nFrames = logProb.size(0), logProb.size(1) # Find out the lengths of the label sequences labelLen = torch.from_numpy(numpy.array([len(x) for x in label])).cuda() # Insert blank symbol at the beginning, at the end, and between all symbols of the label sequences nStates = max(len(x) for x in label) * 2 + 1 extendedLabel = numpy.zeros((nSeqs, nStates), dtype = 'int64') for i in range(nSeqs): extendedLabel[i, 1 : (len(label[i]) * 2) : 2] = label[i] label = torch.from_numpy(extendedLabel).cuda() # Compute alpha trellis dummyColumn = Variable(-numpy.inf * torch.ones((nSeqs, 1)).cuda()) allSeqIndex = torch.from_numpy(numpy.tile(numpy.arange(nSeqs), (nStates, 1)).T).cuda() uttLogProb = Variable(torch.zeros(nSeqs).cuda()) for frame in range(nFrames): if frame == 0: # Initialize the log probability first two states to log(1), and other states to log(0) alpha = Variable(-numpy.inf * torch.ones((nSeqs, nStates)).cuda()) alpha[:, :2] = 0 else: # Receive probability from previous frame p2 = alpha[:, :-2].clone() p2[label[:, 2:] == label[:, :-2]] = -numpy.inf # Probability can pass across labels two steps apart if they are different alpha = logsumexp(alpha, torch.cat([dummyColumn, alpha[:, :-1]], 1), torch.cat([dummyColumn, dummyColumn, p2], 1)) # Multiply with the probability of current frame alpha += logProb[allSeqIndex, frame, label] # Collect probability for ends of utterances seqIndex = (seqLen == frame + 1).nonzero()[0] if len(seqIndex) > 0: seqIndex = torch.from_numpy(seqIndex).cuda() ll = labelLen[seqIndex] p = alpha[seqIndex, ll * 2].clone() if (ll > 0).any(): p[ll > 0] = logsumexp(p[ll > 0], alpha[seqIndex[ll > 0], ll[ll > 0] * 2 - 1]) uttLogProb[seqIndex] = p # Return the per-frame negative log probability of all utterances (and per-utterance log probs if debug == True) loss = -uttLogProb.sum() / seqLen.sum() if debug: return loss, uttLogProb else: return loss if __name__ == '__main__': torch.set_printoptions(precision = 5) label = numpy.array([[2, 1, 1, 3], # BAAC [0, 0, 0, 0], # null [1, 0, 0, 0], # A [3, 2, 0, 0], # CB [0, 0, 0, 0], # null [1, 0, 0, 0], # A [3, 2, 0, 0]]) # CB seqLen = numpy.array([5, 3, 3, 3, 1, 1, 1]) logProb = numpy.log(numpy.tile(numpy.array([[[0.1, 0.2, 0.3, 0.4]]], dtype = 'float32'), (len(seqLen), max(seqLen), 1))) logProb = Variable(torch.from_numpy(logProb).cuda(), requires_grad = True) loss, uttLogProb = ctc_loss(logProb, seqLen, label, debug = True) print loss, torch.exp(uttLogProb) # Expected output of torch.exp(uttLogProb): [0.00048, 0.001, 0.022, 0.12, 0.1, 0.2, 0] loss.backward() # print logProb.grad ================================================ FILE: code/sequential/ctl.py ================================================ import numpy numpy.seterr(divide = 'ignore') import torch from torch.autograd import Variable def cuda(x): return x.cuda() if torch.cuda.is_available() else x def tensor(array): if array.dtype == 'bool': array = array.astype('uint8') return cuda(torch.from_numpy(array)) def variable(array): if isinstance(array, numpy.ndarray): array = tensor(array) return cuda(Variable(array)) def logsumexp(*args): M = reduce(torch.max, args) mask = M != -numpy.inf M[mask] += torch.log(sum(torch.exp(x[mask] - M[mask]) for x in args)) # Must pick the valid part out, otherwise the gradient will contain NaNs return M # Input arguments: # frameProb: a 3-D Variable of size N_SEQS * N_FRAMES * N_CLASSES containing the probability of each event at each frame. # seqLen: a list or numpy array indicating the number of valid frames in each sequence. # label: a list of label sequences. # Note on implementation: # Anything that will be backpropped must be a Variable; # Anything used as an index must be a torch.cuda.LongTensor. def ctl_loss(frameProb, seqLen, label, maxConcur = 1, debug = False): seqLen = numpy.array(seqLen) nSeqs, nFrames, nClasses = frameProb.size() # Clear the content in the frames of frameProb beyond seqLen frameIndex = numpy.tile(numpy.arange(nFrames), (nSeqs, 1)) mask = variable(numpy.expand_dims(frameIndex < seqLen.reshape((nSeqs, 1)), 2)) z = variable(torch.zeros(frameProb.size())) frameProb = torch.where(mask, frameProb, z) # Convert frameProb (probabilities of events) into probabilities of event boundaries z = variable(1e-7 * torch.ones((nSeqs, 1, nClasses))) # Real zeros would cause NaNs in the gradient frameProb = torch.cat([z, frameProb, z], dim = 1) startProb = torch.clamp(frameProb[:, 1:] - frameProb[:, :-1], min = 1e-7) endProb = torch.clamp(frameProb[:, :-1] - frameProb[:, 1:], min = 1e-7) boundaryProb = torch.stack([startProb, endProb], dim = 3).view((nSeqs, nFrames + 1, nClasses * 2)) blankLogProb = torch.log(1 - boundaryProb).sum(dim = 2) # blankLogProb[seq, frame] = log probability of emitting nothing at this frame deltaLogProb = torch.log(boundaryProb) - torch.log(1 - boundaryProb) # deltaLogProb[seq, frame, token] = log prob of emitting token minus log prob of not emitting token # Find out the lengths of the label sequences labelLen = tensor(numpy.array([len(x) for x in label])) # Put the label sequences into a Variable maxLabelLen = max(len(x) for x in label) L = numpy.zeros((nSeqs, maxLabelLen), dtype = 'int64') for i in range(nSeqs): L[i, :len(label[i])] = numpy.array(label[i]) - 1 # minus one because we no longer have a dedicated blank token label = tensor(L) if maxConcur > maxLabelLen: maxConcur = maxLabelLen # Compute alpha trellis # alpha[m, n] = log probability of having emitted n tokens in the m-th sequence at the current frame nStates = maxLabelLen + 1 alpha = variable(-numpy.inf * torch.ones((nSeqs, nStates))) alpha[:, 0] = 0 seqIndex = tensor(numpy.tile(numpy.arange(nSeqs), (nStates, 1)).T) dummyColumns = variable(-numpy.inf * torch.ones((nSeqs, maxConcur))) uttLogProb = variable(torch.zeros(nSeqs)) for frame in range(nFrames + 1): # +1 because we are considering boundaries # Case 0: don't emit anything at current frame p = alpha + blankLogProb[:, frame].view((-1, 1)) alpha = p for i in range(1, maxConcur + 1): # Case i: emit i tokens at current frame p = p[:, :-1] + deltaLogProb[seqIndex[:, i:], frame, label[:, (i-1):]] alpha = logsumexp(alpha, torch.cat([dummyColumns[:, :i], p], dim = 1)) # Collect probability for ends of utterances finishedSeqs = (seqLen == frame).nonzero()[0] if len(finishedSeqs) > 0: finishedSeqs = tensor(finishedSeqs) uttLogProb[finishedSeqs] = alpha[finishedSeqs, labelLen[finishedSeqs]].clone() # Return the per-frame negative log probability of all utterances (and per-utterance log probs if debug == True) loss = -uttLogProb.sum() / (seqLen + 1).sum() if debug: return loss, uttLogProb else: return loss if __name__ == '__main__': def strip(variable): return variable.data.cpu().numpy() torch.set_printoptions(precision = 5) frameProb = numpy.array([[[0.1, 0.9, 0.9], [0.1, 0.9, 0.9], [0.1, 0.9, 0.9], [0.1, 0.9, 0.1]]], dtype = 'float32') # event B all the time; event C in the first three frames frameProb = numpy.tile(frameProb, (4, 1, 1)) frameProb = Variable(tensor(frameProb), requires_grad = True) label = [[3, 5, 6, 4], [3, 4], [5, 6], []] # ; ; ; empty seqLen = numpy.array([4, 4, 4, 4]) loss, uttLogProb = ctl_loss(frameProb, seqLen, label, maxConcur = 1, debug = True) print strip(loss), strip(torch.exp(uttLogProb)) loss, uttLogProb = ctl_loss(frameProb, seqLen, label, maxConcur = 2, debug = True) print strip(loss), strip(torch.exp(uttLogProb)) loss, uttLogProb = ctl_loss(frameProb, seqLen, label, maxConcur = 3, debug = True) print strip(loss), strip(torch.exp(uttLogProb)) # Reference output: # [ 1.45882034] [ 2.10689101e-03 2.61903927e-03 1.27433671e-03 3.03234774e-05] # Prob of first label sequence is small # [ 1.26348567] [ 1.04593262e-01 2.61992868e-03 1.27623521e-03 3.03234774e-05] # Prob of first label sequence gets big, because can be emitted at the same time # [ 1.263484 ] [ 1.04596682e-01 2.61992868e-03 1.27623521e-03 3.03234774e-05] # Prob of first label sequence stays almost the same, because it doesn't need to emit three tokens at the same time loss.backward() print frameProb.grad ================================================ FILE: code/sequential/eval.py ================================================ import sys, os, os.path import argparse import numpy from util_out import * from util_f1 import * from scipy.io import loadmat, savemat # Parse input arguments def mybool(s): return s.lower() in ['t', 'true', 'y', 'yes', '1'] parser = argparse.ArgumentParser() parser.add_argument('--mode', type = str, default = 'ctl', choices = ['strong', 'mil', 'ctc', 'ctl', 'combine']) parser.add_argument('--embedding_size', type = int, default = 512) # This is the embedding size after a pooling layer or after the GRU layer # After a non-pooling layer, the embeddings size will be twice this much parser.add_argument('--n_conv_layers', type = int, default = 6) parser.add_argument('--kernel_size', type = str, default = '3') # 'n' or 'nxm' parser.add_argument('--n_pool_layers', type = int, default = 6) # the pooling layers will be inserted uniformly into the conv layers # the should be at least 2 and at most 6 pooling layers # the first two pooling layers will have stride (2,2); later ones will have stride (1,2) parser.add_argument('--max_concur', type = int, default = 1) parser.add_argument('--mil_weight', type = float, default = 3.3) parser.add_argument('--ctl_weight', type = float, default = 1.0) parser.add_argument('--batch_norm', type = mybool, default = True) parser.add_argument('--dropout', type = float, default = 0.0) parser.add_argument('--batch_size', type = int, default = 500) parser.add_argument('--ckpt_size', type = int, default = 200) # how many batches per checkpoint parser.add_argument('--optimizer', type = str, default = 'adam', choices = ['adam', 'sgd']) parser.add_argument('--init_lr', type = float, default = 1e-3) parser.add_argument('--lr_patience', type = int, default = 3) parser.add_argument('--lr_factor', type = float, default = 1.0) parser.add_argument('--random_seed', type = int, default = 15213) parser.add_argument('--ckpt', type = int) args = parser.parse_args() if 'x' not in args.kernel_size: args.kernel_size = args.kernel_size + 'x' + args.kernel_size # Locate model file and prepare directories for prediction and evaluation expid = '%s-embed%d-%dC%dP-kernel%s%s%s-%s-drop%.1f-batch%d-ckpt%d-%s-lr%.0e-pat%d-fac%.1f-seed%d' % ( args.mode, args.embedding_size, args.n_conv_layers, args.n_pool_layers, args.kernel_size, '-concur%d' % args.max_concur if args.mode in ['ctl', 'combine'] else '', '-weight%g:%g' % (args.mil_weight, args.ctl_weight) if args.mode == 'combine' else '', 'bn' if args.batch_norm else 'nobn', args.dropout, args.batch_size, args.ckpt_size, args.optimizer, args.init_lr, args.lr_patience, args.lr_factor, args.random_seed ) WORKSPACE = os.path.join('../../workspace/sequential', expid) MODEL_FILE = os.path.join(WORKSPACE, 'model', 'checkpoint%d.pt' % args.ckpt) PRED_PATH = os.path.join(WORKSPACE, 'pred') if not os.path.exists(PRED_PATH): os.makedirs(PRED_PATH) PRED_FILE = os.path.join(PRED_PATH, 'checkpoint%d.mat' % args.ckpt) EVAL_PATH = os.path.join(WORKSPACE, 'eval') if not os.path.exists(EVAL_PATH): os.makedirs(EVAL_PATH) EVAL_FILE = os.path.join(EVAL_PATH, 'checkpoint%d.txt' % args.ckpt) with open(EVAL_FILE, 'w'): pass def write_log(s): print s with open(EVAL_FILE, 'a') as f: f.write(s + '\n') if os.path.exists(PRED_FILE): # Load saved predictions, no need to use GPU data = loadmat(PRED_FILE) thres = data['thres'].ravel() eval_frame_y = data['eval_frame_y'] eval_frame_prob = data['eval_frame_prob'] else: import torch import torch.nn as nn from torch.optim import * from torch.optim.lr_scheduler import * from torch.autograd import Variable from Net import Net from util_in import * # Load model args.kernel_size = tuple(int(x) for x in args.kernel_size.split('x')) model = Net(args).cuda() model.load_state_dict(torch.load(MODEL_FILE)['model']) model.eval() # Load data valid_x, valid_frame_y, _, _ = bulk_load('GAS_valid') eval_x, eval_frame_y, _, eval_hashes = bulk_load('GAS_eval') # Predict if args.mode == 'ctc': thres = numpy.array([0.5] * eval_frame_y.shape[-1]) eval_log_prob = model.predict(eval_x) eval_frame_prob = ctc_decode(eval_log_prob).astype('float32') else: valid_frame_prob = model.predict(valid_x) thres, valid_f1 = optimize_gas_valid(valid_frame_prob, valid_frame_y) eval_frame_prob = model.predict(eval_x) # Save predictions data = {} data['thres'] = thres data['eval_hashes'] = eval_hashes data['eval_frame_y'] = eval_frame_y data['eval_frame_prob'] = eval_frame_prob if args.mode == 'ctc': data['eval_log_prob'] = eval_log_prob savemat(PRED_FILE, data) # Evaluation write_log(' CLASS || THRES || TP | FN | FP | Prec. | Recall | F1 ') FORMAT1 = ' Macro Avg || || | | | | | %6.02f ' FORMAT2 = ' %######9d || %8.0006f || %##5d | %##5d | %##5d | %6.02f | %6.02f | %6.02f ' SEP = ''.join('+' if c == '|' else '-' for c in FORMAT1) write_log(SEP) TP, FN, FP, precision, recall, f1 = evaluate_gas_eval(eval_frame_prob, thres, eval_frame_y, verbose = True) write_log(FORMAT1 % f1.mean()) write_log(SEP) N_CLASSES = len(f1) for i in range(N_CLASSES): write_log(FORMAT2 % (i, thres[i], TP[i], FN[i], FP[i], precision[i], recall[i], f1[i])) ================================================ FILE: code/sequential/train.py ================================================ import sys, os, os.path, time import argparse import numpy import torch import torch.nn as nn from torch.optim import * from torch.optim.lr_scheduler import * from torch.autograd import Variable from Net import Net from ctc import ctc_loss from ctl import ctl_loss from util_in import * from util_out import * from util_f1 import * torch.backends.cudnn.benchmark = True # Parse input arguments def mybool(s): return s.lower() in ['t', 'true', 'y', 'yes', '1'] parser = argparse.ArgumentParser() parser.add_argument('--mode', type = str, default = 'ctl', choices = ['strong', 'mil', 'ctc', 'ctl', 'combine']) parser.add_argument('--embedding_size', type = int, default = 512) # This is the embedding size after a pooling layer or after the GRU layer # After a non-pooling layer, the embeddings size will be twice this much parser.add_argument('--n_conv_layers', type = int, default = 6) parser.add_argument('--kernel_size', type = str, default = '3') # 'n' or 'nxm' parser.add_argument('--n_pool_layers', type = int, default = 6) # the pooling layers will be inserted uniformly into the conv layers # the should be at least 2 and at most 6 pooling layers # the first two pooling layers will have stride (2,2); later ones will have stride (1,2) parser.add_argument('--max_concur', type = int, default = 1) # for mode == 'ctl' or 'combine' only parser.add_argument('--mil_weight', type = float, default = 3.3) # for mode == 'combine' only parser.add_argument('--ctl_weight', type = float, default = 1.0) # for mode == 'combine' only parser.add_argument('--batch_norm', type = mybool, default = True) parser.add_argument('--dropout', type = float, default = 0.0) parser.add_argument('--batch_size', type = int, default = 500) parser.add_argument('--ckpt_size', type = int, default = 200) # how many batches per checkpoint parser.add_argument('--optimizer', type = str, default = 'adam', choices = ['adam', 'sgd']) parser.add_argument('--init_lr', type = float, default = 1e-3) parser.add_argument('--lr_patience', type = int, default = 3) parser.add_argument('--lr_factor', type = float, default = 1.0) parser.add_argument('--max_ckpt', type = int, default = 100) parser.add_argument('--random_seed', type = int, default = 15213) args = parser.parse_args() if 'x' not in args.kernel_size: args.kernel_size = args.kernel_size + 'x' + args.kernel_size numpy.random.seed(args.random_seed) # Prepare log file and model directory expid = '%s-embed%d-%dC%dP-kernel%s%s%s-%s-drop%.1f-batch%d-ckpt%d-%s-lr%.0e-pat%d-fac%.1f-seed%d' % ( args.mode, args.embedding_size, args.n_conv_layers, args.n_pool_layers, args.kernel_size, '-concur%d' % args.max_concur if args.mode in ['ctl', 'combine'] else '', '-weight%g:%g' % (args.mil_weight, args.ctl_weight) if args.mode == 'combine' else '', 'bn' if args.batch_norm else 'nobn', args.dropout, args.batch_size, args.ckpt_size, args.optimizer, args.init_lr, args.lr_patience, args.lr_factor, args.random_seed ) WORKSPACE = os.path.join('../../workspace/sequential', expid) MODEL_PATH = os.path.join(WORKSPACE, 'model') if not os.path.exists(MODEL_PATH): os.makedirs(MODEL_PATH) LOG_FILE = os.path.join(WORKSPACE, 'train.log') with open(LOG_FILE, 'w'): pass def write_log(s): timestamp = time.strftime('%m-%d %H:%M:%S') msg = '[' + timestamp + '] ' + s print msg with open(LOG_FILE, 'a') as f: f.write(msg + '\n') # Load data write_log('Loading data ...') train_gen = batch_generator(batch_size = args.batch_size, random_seed = args.random_seed) gas_valid_x, gas_valid_y_frame, gas_valid_y_seq, _ = bulk_load('GAS_valid') gas_eval_x, gas_eval_y_frame, gas_eval_y_seq, _ = bulk_load('GAS_eval') # Build model args.kernel_size = tuple(int(x) for x in args.kernel_size.split('x')) model = Net(args).cuda() if args.optimizer == 'sgd': optimizer = SGD(model.parameters(), lr = args.init_lr, momentum = 0.9, nesterov = True) elif args.optimizer == 'adam': optimizer = Adam(model.parameters(), lr = args.init_lr) scheduler = ReduceLROnPlateau(optimizer, mode = 'max', factor = args.lr_factor, patience = args.lr_patience) if args.lr_factor < 1.0 else None # Train model write_log('Training model ...') write_log(' CKPT | LR | Tr.LOSS || G.Val.F1 | G.Ev.F1 ') FORMAT = ' %#4d | %8.0003g | %8.0006f || %8.0002f | %8.0002f ' SEP = ''.join('+' if c == '|' else '-' for c in FORMAT) write_log(SEP) checkpoint = 0 best_gv_f1 = None best_ge_f1 = None bce_loss = nn.BCELoss() for checkpoint in range(1, args.max_ckpt + 1): # Train for args.ckpt_size batches model.train() train_loss = 0 for batch in range(1, args.ckpt_size + 1): x, y_global, y_seq, y_frame = next(train_gen) optimizer.zero_grad() if args.mode == 'strong': frame_prob = model(x) loss = bce_loss(frame_prob, y_frame) elif args.mode == 'mil': frame_prob = model(x) global_prob = (frame_prob * frame_prob).sum(dim = 1) / frame_prob.sum(dim = 1) # linear softmax pooling function loss = bce_loss(global_prob, y_global) elif args.mode == 'ctc': log_prob = model(x) seq_len = numpy.array([log_prob.shape[1]] * log_prob.shape[0]) # actually all batches are the same size loss = ctc_loss(log_prob, seq_len, y_seq) elif args.mode == 'ctl': frame_prob = model(x) seq_len = numpy.array([frame_prob.shape[1]] * frame_prob.shape[0]) # actually all batches are the same size loss = ctl_loss(frame_prob, seq_len, y_seq, args.max_concur) elif args.mode == 'combine': frame_prob = model(x) global_prob = (frame_prob * frame_prob).sum(dim = 1) / frame_prob.sum(dim = 1) # linear softmax pooling function mil_loss = bce_loss(global_prob, y_global) seq_len = numpy.array([frame_prob.shape[1]] * frame_prob.shape[0]) # actually all batches are the same size ctl_loss_ = ctl_loss(frame_prob, seq_len, y_seq, args.max_concur) loss = mil_loss * args.mil_weight + ctl_loss_ * args.ctl_weight train_loss += loss.data[0] if numpy.isnan(train_loss) or numpy.isinf(train_loss): break loss.backward() optimizer.step() sys.stderr.write('Checkpoint %d, Batch %d / %d, avg train loss = %f\r' % \ (checkpoint, batch, args.ckpt_size, train_loss / batch)) train_loss /= args.ckpt_size # Evaluate model model.eval() def predict(x): if args.mode != 'ctc': return model.predict(x) else: log_prob = model.predict(x) return ctc_decode(log_prob).astype('float32') sys.stderr.write('Evaluating model on GAS_VALID ...\r') frame_prob = predict(gas_valid_x) thres, gv_f1 = optimize_gas_valid(frame_prob, gas_valid_y_frame) sys.stderr.write('Evaluating model on GAS_EVAL ...\r') frame_prob = predict(gas_eval_x) ge_f1 = evaluate_gas_eval(frame_prob, thres, gas_eval_y_frame, verbose = False) # Write log write_log(FORMAT % (checkpoint, optimizer.param_groups[0]['lr'], train_loss, gv_f1, ge_f1)) # Abort if training has gone mad if numpy.isnan(train_loss) or numpy.isinf(train_loss): write_log('Aborted.') break # Save model regularly. Too bad I can't save the scheduler MODEL_FILE = os.path.join(MODEL_PATH, 'checkpoint%d.pt' % checkpoint) state = {'model': model.state_dict(), 'optimizer': optimizer.state_dict()} sys.stderr.write('Saving model to %s ...\r' % MODEL_FILE) torch.save(state, MODEL_FILE) # Update learning rate if scheduler is not None: scheduler.step(gv_f1) # Update best results if best_gv_f1 is None or gv_f1 > best_gv_f1: best_gv_f1 = gv_f1 best_gv_ckpt = checkpoint if best_ge_f1 is None or ge_f1 > best_ge_f1: best_ge_f1 = ge_f1 best_ge_ckpt = checkpoint write_log('DONE!') ================================================ FILE: code/sequential/util_f1.py ================================================ import numpy # Compute F1 given predictions and truth def f1(pred, truth): return 2.0 * (pred & truth).sum() / (pred.sum() + truth.sum()) # Given scores and truth for a single class (as 1-D numpy arrays), find optimal threshold and corresponding F1 # Statistics of other classes may be given to optimize micro-average F1 def optimize_f1(scores, truth, extraNcorr = 0, extraNtrue = 0, extraNpred = 0): # Start with predicting everything as negative best_thres = numpy.inf best_f1 = 0.0 num = extraNcorr # number of correctly predicted instances den = extraNtrue + extraNpred + truth.sum() # number of predicted instances + true instances instances = [(-numpy.inf, False)] + sorted(zip(scores, truth)) # Lower the threshold gradually for i in range(len(instances) - 1, 0, -1): if instances[i][1]: num += 1 den += 1 if instances[i][0] > instances[i-1][0]: # Can put threshold here f1 = 2.0 * num / den if f1 > best_f1: best_thres = (instances[i][0] + instances[i-1][0]) / 2 best_f1 = f1 return best_thres, best_f1 # Given scores and truth for many classes (as 2-D numpy arrays), # find the optimal class-specific thresholds (as a 1-D numpy array) that maximizes the micro-average F1 # The algorithm is stochastic, but I have always observed deterministic results def optimize_micro_avg_f1(scores, truth): # First optimize each class individually nClasses = truth.shape[1] thres = numpy.zeros(nClasses, dtype = 'float64') for i in range(nClasses): thres[i], _ = optimize_f1(scores[:,i], truth[:,i]) Ntrue = truth.sum(axis = 0) Npred = (scores >= thres).sum(axis = 0) Ncorr = ((scores >= thres) & truth).sum(axis = 0) # Repeatly re-tune the threshold for each class until convergence candidates = range(nClasses) while len(candidates) > 0: i = numpy.random.choice(candidates) candidates.remove(i) old_thres = thres[i] thres[i], _ = optimize_f1( scores[:,i], truth[:,i], extraNcorr = Ncorr.sum() - Ncorr[i], extraNtrue = Ntrue.sum() - Ntrue[i], extraNpred = Npred.sum() - Npred[i], ) if thres[i] != old_thres: Npred[i] = (scores[:,i] >= thres[i]).sum(axis = 0) Ncorr[i] = ((scores[:,i] >= thres[i]) & truth[:,i]).sum(axis = 0) candidates = range(nClasses) candidates.remove(i) return thres ================================================ FILE: code/sequential/util_in.py ================================================ import sys, os, os.path, glob import cPickle from scipy.io import loadmat import numpy from multiprocessing import Process, Queue import torch from torch.autograd import Variable N_CLASSES = 35 N_WORKERS = 6 FEATURE_DIR = '../../data/sequential' with open(os.path.join(FEATURE_DIR, 'normalizer.pkl'), 'rb') as f: mu, sigma = cPickle.load(f) def sample_generator(file_list, random_seed = 15213): rng = numpy.random.RandomState(random_seed) while True: rng.shuffle(file_list) for filename in file_list: data = loadmat(filename) feat = ((data['feat'] - mu) / sigma).astype('float32') labels = data['labels'].astype('bool') for i in range(len(data['feat'])): yield feat[i], labels[i] def worker(queues, file_lists, random_seed): generators = [sample_generator(file_lists[i], random_seed + i) for i in range(len(file_lists))] while True: for gen, q in zip(generators, queues): q.put(next(gen)) def batch_generator(batch_size, random_seed = 15213): queues = [Queue(5) for class_id in range(N_CLASSES)] file_lists = [sorted(glob.glob(os.path.join(FEATURE_DIR, 'GAS_train_unbalanced_class%02d_part*.mat' % class_id))) for class_id in range(N_CLASSES)] for worker_id in range(N_WORKERS): p = Process(target = worker, args = (queues[worker_id::N_WORKERS], file_lists[worker_id::N_WORKERS], random_seed)) p.daemon = True p.start() rng = numpy.random.RandomState(random_seed) batch_x = []; batch_y_global = []; batch_y_seq = []; batch_y_frame = [] while True: rng.shuffle(queues) for q in queues: x, y_frame = q.get() batch_x.append(x) batch_y_global.append(y_frame.max(axis = -2)) batch_y_seq.append(mask2ctc(y_frame)) batch_y_frame.append(y_frame) if len(batch_x) == batch_size: yield Variable(torch.from_numpy(numpy.stack(batch_x))).cuda(), \ Variable(torch.from_numpy(numpy.stack(batch_y_global).astype('float32'))).cuda(), \ batch_y_seq, \ Variable(torch.from_numpy(numpy.stack(batch_y_frame).astype('float32'))).cuda() batch_x = []; batch_y_global = []; batch_y_seq = []; batch_y_frame = [] def bulk_load(prefix): data = loadmat(os.path.join(FEATURE_DIR, prefix + '.mat')) x = ((data['feat'] - mu) / sigma).astype('float32') y_frame = data['labels'].astype('bool') y_seq = [mask2ctc(y) for y in y_frame] return x, y_frame, y_seq, data['hashes'] def mask2ctc(mask): z = numpy.zeros((1, mask.shape[-1]), dtype = 'bool') zp = numpy.concatenate([z, mask]) pz = numpy.concatenate([mask, z]) onset = (pz & ~zp).nonzero() offset = (zp & ~pz).nonzero() boundaries = sorted([(t, 1, event) for (t, event) in zip(*onset)] + [(t, -1, event) for (t, event) in zip(*offset)]) # time, onset/offset, event id return [bound[2] * 2 + {1:1, -1:2}[bound[1]] for bound in boundaries] ================================================ FILE: code/sequential/util_out.py ================================================ import numpy from util_f1 import * from joblib import Parallel, delayed N_JOBS = 6 def ctc_decode(log_prob): # Decode log_prob (boundary probabilities, batch * frame * (2n+1)) to frame_pred (boolean event decisions, batch * frame * n) nSeqs, nFrames, nLabels = log_prob.shape nClasses = (nLabels - 1) / 2 frame_pred = numpy.zeros((nSeqs, nFrames, nClasses), dtype = 'bool') for i in range(nSeqs): onset = [None] * nClasses prev_token = 0 for t, token in zip(range(nFrames), log_prob[i].argmax(axis = 1)): if token == 0: continue if token % 2 == 1: # onset of event event = (token - 1) / 2 onset[event] = t else: # offset of event event = token / 2 - 1 if onset[event] is not None: frame_pred[i, onset[event] : t + 1, event] = True onset[event] = None return frame_pred def optimize_gas_valid(pred, y): nClasses = y.shape[-1] result = Parallel(n_jobs = N_JOBS)(delayed(optimize_f1)(pred[..., i].ravel(), y[..., i].ravel()) for i in range(nClasses)) thres = numpy.array([r[0] for r in result], dtype = 'float64') class_f1 = numpy.array([r[1] for r in result], dtype = 'float32') * 100.0 return thres, class_f1.mean() def TP_FN_FP(pred, truth): TP = (pred & truth).sum() FN = (~pred & truth).sum() FP = (pred & ~truth).sum() return (TP, FN, FP) def evaluate_gas_eval(pred, thres, truth, verbose = False): # if verbose == False, return only the macro-average F1 # if verbose == True, return the class-wise TP, FN, FP, precision, recall, F1 pred = pred >= thres nClasses = len(thres) stats = Parallel(n_jobs = N_JOBS)(delayed(TP_FN_FP)(pred[..., i], truth[..., i]) for i in range(nClasses)) TP, FN, FP = numpy.array(stats, dtype = 'int32').T f1 = 200.0 * TP / (2 * TP + FN + FP) if not verbose: return f1.mean() precision = 100.0 * TP / (TP + FP) recall = 100.0 * TP / (TP + FN) return TP, FN, FP, precision, recall, f1 ================================================ FILE: data/download.sh ================================================ archives="audioset.tgz sequential.tgz dcase.tgz" for archive in $archives; do wget http://islpc21.is.cs.cmu.edu/yunwang/git/cmu-thesis/data/$archive && ((tar zxf $archive && rm $archive) &) done while [ $(ls $archives 2>/dev/null | wc -l) -ne 0 ]; do echo -ne "Extracting file $(ls ${archives//.tgz/\/*} 2>/dev/null | wc -l) of 47457 ...\r" sleep 10; done echo -e "\nAll files extracted. DONE!" ================================================ FILE: workspace/.gitignore ================================================