Repository: jayg996/BTC-ISMIR19
Branch: master
Commit: 2682317be668
Files: 21
Total size: 23.4 MB
Directory structure:
gitextract_upyc_iog/
├── LICENSE
├── README.md
├── audio_dataset.py
├── baseline_models.py
├── btc_model.py
├── crf_model.py
├── run_config.yaml
├── test/
│ ├── btc_model.pt
│ └── btc_model_large_voca.pt
├── test.py
├── train.py
├── train_crf.py
└── utils/
├── __init__.py
├── chords.py
├── hparams.py
├── logger.py
├── mir_eval_modules.py
├── preprocess.py
├── pytorch_utils.py
├── tf_logger.py
└── transformer_modules.py
================================================
FILE CONTENTS
================================================
================================================
FILE: LICENSE
================================================
MIT License
Copyright (c) 2019 Jonggwon Park
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
================================================
FILE: README.md
================================================
# A Bi-Directional Transformer for Musical Chord Recognition
This repository has the source codes for the paper "A Bi-Directional Transformer for Musical Chord Recognition"(ISMIR19).
<img src="png/model.png">
## Requirements
- pytorch >= 1.0.0
- numpy >= 1.16.2
- pandas >= 0.24.1
- pyrubberband >= 0.3.0
- librosa >= 0.6.3
- pyyaml >= 3.13
- mir_eval >= 0.5
- pretty_midi >= 0.2.8
## File descriptions
* `audio_dataset.py` : loads data and preprocesses label files to chord labels and mp3 files to constant-q transformation.
* `btc_model.py` : contains pytorch implementation of BTC.
* `train.py` : for training.
* `crf_model.py` : contatins pytorch implementation of Conditional Random Fields (CRFs) .
* `baseline_models.py` : contains the codes of baseline models.
* `train_crf.py` : for training CRFs.
* `run_config.yaml` : includes hyper parameters and paths that are needed.
* `test.py` : for recognizing chord from audio file.
## Using BTC : Recognizing chords from files in audio directory
### Using BTC from command line
```bash
$ python test.py --audio_dir audio_folder --save_dir save_folder --voca False
```
* audio_dir : a folder of audio files for chord recognition (default: './test')
* save_dir : a forder for saving recognition results (default: './test')
* voca : False means major and minor label type, and True means large vocabulary label type (default: False)
The resulting files are lab files of the form shown below and midi files.
<img src="png/example.png">
## Attention Map
The figures represent the probability values of the attention of self-attention layers 1, 3, 5 and 8 respectively. The
layers that best represent the different characteristics of each layers were chosen. The input audio is the song "Just A Girl"
(0m30s ~ 0m40s) by No Doubt from UsPop2002, which was in evaluation data.
<img src="png/attention.png">
## Data
We used Isophonics[1], Robbie Williams[2], UsPop2002[3] dataset which consists of chord label files. Due to copyright issue, these datasets do not include audio files. The audio files used in this work were collected from online music service providers.
[1] http://isophonics.net/datasets
[2] B. Di Giorgi, M. Zanoni, A. Sarti, and S. Tubaro. Automatic
chord recognition based on the probabilistic
modeling of diatonic modal harmony. In Proc. of the
8th International Workshop on Multidimensional Systems,
Erlangen, Germany, 2013.
[3] https://github.com/tmc323/Chord-Annotations
## Reference
* pytorch implementation of Transformer and Crf: https://github.com/kolloldas/torchnlp
## Comments
* Any comments for the codes are always welcome.
================================================
FILE: audio_dataset.py
================================================
import numpy as np
import os
import torch
from torch.utils.data import Dataset, DataLoader
from utils.preprocess import Preprocess, FeatureTypes
import math
from multiprocessing import Pool
from sortedcontainers import SortedList
class AudioDataset(Dataset):
def __init__(self, config, root_dir='/data/music/chord_recognition', dataset_names=('isophonic',),
featuretype=FeatureTypes.cqt, num_workers=20, train=False, preprocessing=False, resize=None, kfold=4):
super(AudioDataset, self).__init__()
self.config = config
self.root_dir = root_dir
self.dataset_names = dataset_names
self.preprocessor = Preprocess(config, featuretype, dataset_names, self.root_dir)
self.resize = resize
self.train = train
self.ratio = config.experiment['data_ratio']
# preprocessing hyperparameters
# song_hz, n_bins, bins_per_octave, hop_length
mp3_config = config.mp3
feature_config = config.feature
self.mp3_string = "%d_%.1f_%.1f" % \
(mp3_config['song_hz'], mp3_config['inst_len'],
mp3_config['skip_interval'])
self.feature_string = "%s_%d_%d_%d" % \
(featuretype.value, feature_config['n_bins'], feature_config['bins_per_octave'], feature_config['hop_length'])
if feature_config['large_voca'] == True:
# store paths if exists
is_preprocessed = True if os.path.exists(os.path.join(root_dir, 'result', dataset_names[0]+'_voca', self.mp3_string, self.feature_string)) else False
if (not is_preprocessed) | preprocessing:
midi_paths = self.preprocessor.get_all_files()
if num_workers > 1:
num_path_per_process = math.ceil(len(midi_paths) / num_workers)
args = [midi_paths[i * num_path_per_process:(i + 1) * num_path_per_process] for i in range(num_workers)]
# start process
p = Pool(processes=num_workers)
p.map(self.preprocessor.generate_labels_features_voca, args)
p.close()
else:
self.preprocessor.generate_labels_features_voca(midi_paths)
# kfold is 5 fold index ( 0, 1, 2, 3, 4 )
self.song_names, self.paths = self.get_paths_voca(kfold=kfold)
else:
# store paths if exists
is_preprocessed = True if os.path.exists(os.path.join(root_dir, 'result', dataset_names[0], self.mp3_string, self.feature_string)) else False
if (not is_preprocessed) | preprocessing:
midi_paths = self.preprocessor.get_all_files()
if num_workers > 1:
num_path_per_process = math.ceil(len(midi_paths) / num_workers)
args = [midi_paths[i * num_path_per_process:(i + 1) * num_path_per_process]
for i in range(num_workers)]
# start process
p = Pool(processes=num_workers)
p.map(self.preprocessor.generate_labels_features_new, args)
p.close()
else:
self.preprocessor.generate_labels_features_new(midi_paths)
# kfold is 5 fold index ( 0, 1, 2, 3, 4 )
self.song_names, self.paths = self.get_paths(kfold=kfold)
def __len__(self):
return len(self.paths)
def __getitem__(self, idx):
instance_path = self.paths[idx]
res = dict()
data = torch.load(instance_path)
res['feature'] = np.log(np.abs(data['feature']) + 1e-6)
res['chord'] = data['chord']
return res
def get_paths(self, kfold=4):
temp = {}
used_song_names = list()
for name in self.dataset_names:
dataset_path = os.path.join(self.root_dir, "result", name, self.mp3_string, self.feature_string)
song_names = os.listdir(dataset_path)
for song_name in song_names:
paths = []
instance_names = os.listdir(os.path.join(dataset_path, song_name))
if len(instance_names) > 0:
used_song_names.append(song_name)
for instance_name in instance_names:
paths.append(os.path.join(dataset_path, song_name, instance_name))
temp[song_name] = paths
# throw away unused song names
song_names = used_song_names
song_names = SortedList(song_names)
print('Total used song length : %d' %len(song_names))
tmp = []
for i in range(len(song_names)):
tmp += temp[song_names[i]]
print('Total instances (train and valid) : %d' %len(tmp))
# divide train/valid dataset using k fold
result = []
total_fold = 5
quotient = len(song_names) // total_fold
remainder = len(song_names) % total_fold
fold_num = [0]
for i in range(total_fold):
fold_num.append(quotient)
for i in range(remainder):
fold_num[i+1] += 1
for i in range(total_fold):
fold_num[i+1] += fold_num[i]
if self.train:
tmp = []
# get not augmented data
for k in range(total_fold):
if k != kfold:
for i in range(fold_num[k], fold_num[k+1]):
result += temp[song_names[i]]
tmp += song_names[fold_num[k]:fold_num[k + 1]]
song_names = tmp
else:
for i in range(fold_num[kfold], fold_num[kfold+1]):
instances = temp[song_names[i]]
instances = [inst for inst in instances if "1.00_0" in inst]
result += instances
song_names = song_names[fold_num[kfold]:fold_num[kfold+1]]
return song_names, result
def get_paths_voca(self, kfold=4):
temp = {}
used_song_names = list()
for name in self.dataset_names:
dataset_path = os.path.join(self.root_dir, "result", name+'_voca', self.mp3_string, self.feature_string)
song_names = os.listdir(dataset_path)
for song_name in song_names:
paths = []
instance_names = os.listdir(os.path.join(dataset_path, song_name))
if len(instance_names) > 0:
used_song_names.append(song_name)
for instance_name in instance_names:
paths.append(os.path.join(dataset_path, song_name, instance_name))
temp[song_name] = paths
# throw away unused song names
song_names = used_song_names
song_names = SortedList(song_names)
print('Total used song length : %d' %len(song_names))
tmp = []
for i in range(len(song_names)):
tmp += temp[song_names[i]]
print('Total instances (train and valid) : %d' %len(tmp))
# divide train/valid dataset using k fold
result = []
total_fold = 5
quotient = len(song_names) // total_fold
remainder = len(song_names) % total_fold
fold_num = [0]
for i in range(total_fold):
fold_num.append(quotient)
for i in range(remainder):
fold_num[i+1] += 1
for i in range(total_fold):
fold_num[i+1] += fold_num[i]
if self.train:
tmp = []
# get not augmented data
for k in range(total_fold):
if k != kfold:
for i in range(fold_num[k], fold_num[k+1]):
result += temp[song_names[i]]
tmp += song_names[fold_num[k]:fold_num[k + 1]]
song_names = tmp
else:
for i in range(fold_num[kfold], fold_num[kfold+1]):
instances = temp[song_names[i]]
instances = [inst for inst in instances if "1.00_0" in inst]
result += instances
song_names = song_names[fold_num[kfold]:fold_num[kfold+1]]
return song_names, result
def _collate_fn(batch):
batch_size = len(batch)
max_len = batch[0]['feature'].shape[1]
input_percentages = torch.empty(batch_size) # for variable length
chord_lens = torch.empty(batch_size, dtype=torch.int64)
chords = []
collapsed_chords = []
features = []
boundaries = []
for i in range(batch_size):
sample = batch[i]
feature = sample['feature']
chord = sample['chord']
diff = np.diff(chord, axis=0).astype(np.bool)
idx = np.insert(diff, 0, True, axis=0)
chord_lens[i] = np.sum(idx).item(0)
chords.extend(chord)
features.append(feature)
input_percentages[i] = feature.shape[1] / max_len
collapsed_chords.extend(np.array(chord)[idx].tolist())
boundary = np.append([0], diff)
boundaries.extend(boundary.tolist())
features = torch.tensor(features, dtype=torch.float32).unsqueeze(1) # batch_size*1*feature_size*max_len
chords = torch.tensor(chords, dtype=torch.int64) # (batch_size*time_length)
collapsed_chords = torch.tensor(collapsed_chords, dtype=torch.int64) # total_unique_chord_len
boundaries = torch.tensor(boundaries, dtype=torch.uint8) # (batch_size*time_length)
return features, input_percentages, chords, collapsed_chords, chord_lens, boundaries
class AudioDataLoader(DataLoader):
def __init__(self, *args, **kwargs):
super(AudioDataLoader, self).__init__(*args, **kwargs)
self.collate_fn = _collate_fn
================================================
FILE: baseline_models.py
================================================
from utils.hparams import HParams
import torch
import torch.nn as nn
import torch.nn.functional as F
import time
from crf_model import CRF
use_cuda = torch.cuda.is_available()
class CNN(nn.Module):
def __init__(self,config):
super(CNN, self).__init__()
self.timestep = config['timestep']
self.context = 7
self.pad = nn.ConstantPad1d(self.context, 0)
self.probs_out = config['probs_out']
self.num_chords = config['num_chords']
self.drop_out = nn.Dropout2d(p=0.5)
self.conv1 = self.cnn_layers(1, 32, kernel_size=(3,3), padding=1)
self.conv2 = self.cnn_layers(32, 32, kernel_size=(3,3), padding=1)
self.conv3 = self.cnn_layers(32, 32, kernel_size=(3,3), padding=1)
self.conv4 = self.cnn_layers(32, 32, kernel_size=(3,3), padding=1)
self.pool_max = nn.MaxPool2d(kernel_size=(2,1))
self.conv5 = self.cnn_layers(32, 64, kernel_size=(3, 3), padding=0)
self.conv6 = self.cnn_layers(64, 64, kernel_size=(3, 3), padding=0)
self.conv7 = self.cnn_layers(64, 128, kernel_size=(12, 9), padding=0)
self.conv_linear = nn.Conv2d(128, config['num_chords'], kernel_size=(1,1), padding=0)
def cnn_layers(self, in_channels, out_channels, kernel_size, stride=1, padding=0):
layers = []
conv2d = nn.Conv2d(in_channels, out_channels,kernel_size=kernel_size, stride=stride, padding=padding)
batch_norm = nn.BatchNorm2d(out_channels)
relu = nn.ReLU(inplace=True)
layers += [conv2d, batch_norm, relu]
return nn.Sequential(*layers)
def forward(self, x, labels):
x = x.permute(0,2,1)
x = self.pad(x)
batch_size = x.size(0)
for i in range(batch_size):
for j in range(self.timestep):
if i == 0 and j == 0:
inputs = x[i,:,j : j + self.context *2 + 1].unsqueeze(0)
else:
tmp = x[i, :, j : j + self.context *2 + 1].unsqueeze(0)
inputs = torch.cat((inputs,tmp), dim=0)
# inputs : [batchsize * timestep, feature_size, context]
inputs = inputs.unsqueeze(1)
conv = self.conv1(inputs)
conv = self.conv2(conv)
conv = self.conv3(conv)
conv = self.conv4(conv)
pooled = self.pool_max(conv)
pooled = self.drop_out(pooled)
conv = self.conv5(pooled)
conv = self.conv6(conv)
pooled = self.pool_max(conv)
pooled = self.drop_out(pooled)
conv = self.conv7(pooled)
conv = self.drop_out(conv)
conv = self.conv_linear(conv)
avg_pool = nn.AvgPool2d(kernel_size=(conv.size(2), conv.size(3)))
logits = avg_pool(conv).squeeze(2).squeeze(2)
if self.probs_out is True:
crf_input = logits.view(-1, self.timestep, self.num_chords)
return crf_input
log_probs = F.log_softmax(logits, -1)
topk, indices = torch.topk(log_probs, 2)
predictions = indices[:,0]
second = indices[:,1]
prediction = predictions.view(-1)
second = second.view(-1)
loss = F.nll_loss(log_probs.view(-1, self.num_chords), labels.view(-1))
return prediction, loss, 0, second
class Crf(nn.Module):
def __init__(self, num_chords, timestep):
super(Crf, self).__init__()
self.output_size = num_chords
self.timestep = timestep
self.Crf = CRF(self.output_size)
def forward(self, probs, labels):
prediction = self.Crf(probs)
prediction = prediction.view(-1)
labels = labels.view(-1, self.timestep)
loss = self.Crf.loss(probs, labels)
return prediction, loss
class CRNN(nn.Module):
def __init__(self,config):
super(CRNN, self).__init__()
self.feature_size = config['feature_size']
self.timestep = config['timestep']
self.probs_out = config['probs_out']
self.num_chords = config['num_chords']
self.hidden_size = 128
self.relu = nn.ReLU(inplace=True)
self.batch_norm = nn.BatchNorm2d(1)
self.conv1 = nn.Conv2d(1, 1, kernel_size=(5,5), padding=2)
self.conv2 = nn.Conv2d(1, 36, kernel_size=(1,self.feature_size))
self.gru = nn.GRU(input_size=36, hidden_size=self.hidden_size, num_layers=2, batch_first=True, bidirectional=True)
self.fc = nn.Linear(self.hidden_size*2, self.num_chords)
def forward(self, x, labels):
# x : [batchsize * timestep * feature_size]
x = x.unsqueeze(1)
x = self.batch_norm(x)
conv = self.relu(self.conv1(x))
conv = self.relu(self.conv2(conv))
conv = conv.squeeze(3).permute(0,2,1)
h0 = torch.zeros(4, conv.size(0), self.hidden_size).to(torch.device("cuda" if use_cuda else "cpu"))
gru, h = self.gru(conv, h0)
logits = self.fc(gru)
if self.probs_out is True:
# probs = F.softmax(logits, -1)
return logits
log_probs = F.log_softmax(logits, -1)
topk, indices = torch.topk(log_probs, 2)
predictions = indices[:,:,0]
second = indices[:,:,1]
prediction = predictions.view(-1)
second = second.view(-1)
loss = F.nll_loss(log_probs.view(-1, self.num_chords), labels.view(-1))
return prediction, loss, 0, second
if __name__ == "__main__":
config = HParams.load("run_config.yaml")
device = torch.device("cuda" if use_cuda else "cpu")
config.model['probs_out'] = True
batch_size = 2
timestep = config.model['timestep']
feature_size = config.model['feature_size']
num_chords = config.model['num_chords']
features = torch.randn(batch_size,timestep,feature_size,requires_grad=True).to(device)
chords = torch.randint(num_chords,(batch_size*timestep,)).to(device)
model = CNN(config=config.model).to(device)
crf = Crf(num_chords=config.model['num_chords'], timestep=config.model['timestep']).to(device)
probs = model(features, chords)
prediction, total_loss = crf(probs, chords)
print(total_loss)
================================================
FILE: btc_model.py
================================================
from utils.transformer_modules import *
from utils.transformer_modules import _gen_timing_signal, _gen_bias_mask
from utils.hparams import HParams
use_cuda = torch.cuda.is_available()
class self_attention_block(nn.Module):
def __init__(self, hidden_size, total_key_depth, total_value_depth, filter_size, num_heads,
bias_mask=None, layer_dropout=0.0, attention_dropout=0.0, relu_dropout=0.0, attention_map=False):
super(self_attention_block, self).__init__()
self.attention_map = attention_map
self.multi_head_attention = MultiHeadAttention(hidden_size, total_key_depth, total_value_depth,hidden_size, num_heads, bias_mask, attention_dropout, attention_map)
self.positionwise_convolution = PositionwiseFeedForward(hidden_size, filter_size, hidden_size, layer_config='cc', padding='both', dropout=relu_dropout)
self.dropout = nn.Dropout(layer_dropout)
self.layer_norm_mha = LayerNorm(hidden_size)
self.layer_norm_ffn = LayerNorm(hidden_size)
def forward(self, inputs):
x = inputs
# Layer Normalization
x_norm = self.layer_norm_mha(x)
# Multi-head attention
if self.attention_map is True:
y, weights = self.multi_head_attention(x_norm, x_norm, x_norm)
else:
y = self.multi_head_attention(x_norm, x_norm, x_norm)
# Dropout and residual
x = self.dropout(x + y)
# Layer Normalization
x_norm = self.layer_norm_ffn(x)
# Positionwise Feedforward
y = self.positionwise_convolution(x_norm)
# Dropout and residual
y = self.dropout(x + y)
if self.attention_map is True:
return y, weights
return y
class bi_directional_self_attention(nn.Module):
def __init__(self, hidden_size, total_key_depth, total_value_depth, filter_size, num_heads, max_length,
layer_dropout=0.0, attention_dropout=0.0, relu_dropout=0.0):
super(bi_directional_self_attention, self).__init__()
self.weights_list = list()
params = (hidden_size,
total_key_depth or hidden_size,
total_value_depth or hidden_size,
filter_size,
num_heads,
_gen_bias_mask(max_length),
layer_dropout,
attention_dropout,
relu_dropout,
True)
self.attn_block = self_attention_block(*params)
params = (hidden_size,
total_key_depth or hidden_size,
total_value_depth or hidden_size,
filter_size,
num_heads,
torch.transpose(_gen_bias_mask(max_length), dim0=2, dim1=3),
layer_dropout,
attention_dropout,
relu_dropout,
True)
self.backward_attn_block = self_attention_block(*params)
self.linear = nn.Linear(hidden_size*2, hidden_size)
def forward(self, inputs):
x, list = inputs
# Forward Self-attention Block
encoder_outputs, weights = self.attn_block(x)
# Backward Self-attention Block
reverse_outputs, reverse_weights = self.backward_attn_block(x)
# Concatenation and Fully-connected Layer
outputs = torch.cat((encoder_outputs, reverse_outputs), dim=2)
y = self.linear(outputs)
# Attention weights for Visualization
self.weights_list = list
self.weights_list.append(weights)
self.weights_list.append(reverse_weights)
return y, self.weights_list
class bi_directional_self_attention_layers(nn.Module):
def __init__(self, embedding_size, hidden_size, num_layers, num_heads, total_key_depth, total_value_depth,
filter_size, max_length=100, input_dropout=0.0, layer_dropout=0.0,
attention_dropout=0.0, relu_dropout=0.0):
super(bi_directional_self_attention_layers, self).__init__()
self.timing_signal = _gen_timing_signal(max_length, hidden_size)
params = (hidden_size,
total_key_depth or hidden_size,
total_value_depth or hidden_size,
filter_size,
num_heads,
max_length,
layer_dropout,
attention_dropout,
relu_dropout)
self.embedding_proj = nn.Linear(embedding_size, hidden_size, bias=False)
self.self_attn_layers = nn.Sequential(*[bi_directional_self_attention(*params) for l in range(num_layers)])
self.layer_norm = LayerNorm(hidden_size)
self.input_dropout = nn.Dropout(input_dropout)
def forward(self, inputs):
# Add input dropout
x = self.input_dropout(inputs)
# Project to hidden size
x = self.embedding_proj(x)
# Add timing signal
x += self.timing_signal[:, :inputs.shape[1], :].type_as(inputs.data)
# A Stack of Bi-directional Self-attention Layers
y, weights_list = self.self_attn_layers((x, []))
# Layer Normalization
y = self.layer_norm(y)
return y, weights_list
class BTC_model(nn.Module):
def __init__(self, config):
super(BTC_model, self).__init__()
self.timestep = config['timestep']
self.probs_out = config['probs_out']
params = (config['feature_size'],
config['hidden_size'],
config['num_layers'],
config['num_heads'],
config['total_key_depth'],
config['total_value_depth'],
config['filter_size'],
config['timestep'],
config['input_dropout'],
config['layer_dropout'],
config['attention_dropout'],
config['relu_dropout'])
self.self_attn_layers = bi_directional_self_attention_layers(*params)
self.output_layer = SoftmaxOutputLayer(hidden_size=config['hidden_size'], output_size=config['num_chords'], probs_out=config['probs_out'])
def forward(self, x, labels):
labels = labels.view(-1, self.timestep)
# Output of Bi-directional Self-attention Layers
self_attn_output, weights_list = self.self_attn_layers(x)
# return logit values for CRF
if self.probs_out is True:
logits = self.output_layer(self_attn_output)
return logits
# Output layer and Soft-max
prediction,second = self.output_layer(self_attn_output)
prediction = prediction.view(-1)
second = second.view(-1)
# Loss Calculation
loss = self.output_layer.loss(self_attn_output, labels)
return prediction, loss, weights_list, second
if __name__ == "__main__":
config = HParams.load("run_config.yaml")
device = torch.device("cuda" if use_cuda else "cpu")
batch_size = 2
timestep = 108
feature_size = 144
num_chords = 25
features = torch.randn(batch_size,timestep,feature_size,requires_grad=True).to(device)
chords = torch.randint(25,(batch_size*timestep,)).to(device)
model = BTC_model(config=config.model).to(device)
prediction, loss, weights_list, second = model(features, chords)
print(prediction.size())
print(loss)
================================================
FILE: crf_model.py
================================================
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import torch
import torch.nn as nn
class CRF(nn.Module):
"""
Implements Conditional Random Fields that can be trained via
backpropagation.
"""
def __init__(self, num_tags):
super(CRF, self).__init__()
self.num_tags = num_tags
self.transitions = nn.Parameter(torch.Tensor(num_tags, num_tags))
self.start_transitions = nn.Parameter(torch.randn(num_tags))
self.stop_transitions = nn.Parameter(torch.randn(num_tags))
nn.init.xavier_normal_(self.transitions)
def forward(self, feats):
# Shape checks
if len(feats.shape) != 3:
raise ValueError("feats must be 3-d got {}-d".format(feats.shape))
return self._viterbi(feats)
def loss(self, feats, tags):
"""
Computes negative log likelihood between features and tags.
Essentially difference between individual sequence scores and
sum of all possible sequence scores (partition function)
Parameters:
feats: Input features [batch size, sequence length, number of tags]
tags: Target tag indices [batch size, sequence length]. Should be between
0 and num_tags
Returns:
Negative log likelihood [a scalar]
"""
# Shape checks
if len(feats.shape) != 3:
raise ValueError("feats must be 3-d got {}-d".format(feats.shape))
if len(tags.shape) != 2:
raise ValueError('tags must be 2-d but got {}-d'.format(tags.shape))
if feats.shape[:2] != tags.shape:
raise ValueError('First two dimensions of feats and tags must match')
sequence_score = self._sequence_score(feats, tags)
partition_function = self._partition_function(feats)
log_probability = sequence_score - partition_function
# -ve of l()
# Average across batch
return -log_probability.mean()
def _sequence_score(self, feats, tags):
"""
Parameters:
feats: Input features [batch size, sequence length, number of tags]
tags: Target tag indices [batch size, sequence length]. Should be between
0 and num_tags
Returns: Sequence score of shape [batch size]
"""
batch_size = feats.shape[0]
# Compute feature scores
feat_score = feats.gather(2, tags.unsqueeze(-1)).squeeze(-1).sum(dim=-1)
# Compute transition scores
# Unfold to get [from, to] tag index pairs
tags_pairs = tags.unfold(1, 2, 1)
# Use advanced indexing to pull out required transition scores
indices = tags_pairs.permute(2, 0, 1).chunk(2)
trans_score = self.transitions[indices].squeeze(0).sum(dim=-1)
# Compute start and stop scores
start_score = self.start_transitions[tags[:, 0]]
stop_score = self.stop_transitions[tags[:, -1]]
return feat_score + start_score + trans_score + stop_score
def _partition_function(self, feats):
"""
Computes the partitition function for CRF using the forward algorithm.
Basically calculate scores for all possible tag sequences for
the given feature vector sequence
Parameters:
feats: Input features [batch size, sequence length, number of tags]
Returns:
Total scores of shape [batch size]
"""
_, seq_size, num_tags = feats.shape
if self.num_tags != num_tags:
raise ValueError('num_tags should be {} but got {}'.format(self.num_tags, num_tags))
a = feats[:, 0] + self.start_transitions.unsqueeze(0) # [batch_size, num_tags]
transitions = self.transitions.unsqueeze(0) # [1, num_tags, num_tags] from -> to
for i in range(1, seq_size):
feat = feats[:, i].unsqueeze(1) # [batch_size, 1, num_tags]
a = self._log_sum_exp(a.unsqueeze(-1) + transitions + feat, 1) # [batch_size, num_tags]
return self._log_sum_exp(a + self.stop_transitions.unsqueeze(0), 1) # [batch_size]
def _viterbi(self, feats):
"""
Uses Viterbi algorithm to predict the best sequence
Parameters:
feats: Input features [batch size, sequence length, number of tags]
Returns: Best tag sequence [batch size, sequence length]
"""
_, seq_size, num_tags = feats.shape
if self.num_tags != num_tags:
raise ValueError('num_tags should be {} but got {}'.format(self.num_tags, num_tags))
v = feats[:, 0] + self.start_transitions.unsqueeze(0) # [batch_size, num_tags]
transitions = self.transitions.unsqueeze(0) # [1, num_tags, num_tags] from -> to
paths = []
for i in range(1, seq_size):
feat = feats[:, i] # [batch_size, num_tags]
v, idx = (v.unsqueeze(-1) + transitions).max(1) # [batch_size, num_tags], [batch_size, num_tags]
paths.append(idx)
v = (v + feat) # [batch_size, num_tags]
v, tag = (v + self.stop_transitions.unsqueeze(0)).max(1, True)
# Backtrack
tags = [tag]
for idx in reversed(paths):
tag = idx.gather(1, tag)
tags.append(tag)
tags.reverse()
return torch.cat(tags, 1)
def _log_sum_exp(self, logits, dim):
"""
Computes log-sum-exp in a stable way
"""
max_val, _ = logits.max(dim)
return max_val + (logits - max_val.unsqueeze(dim)).exp().sum(dim).log()
================================================
FILE: run_config.yaml
================================================
mp3:
song_hz: 22050
inst_len: 10.0
skip_interval: 5.0
feature:
n_bins: 144
bins_per_octave: 24
hop_length: 2048
large_voca: False
# large_voca: True
experiment:
learning_rate : 0.0001
weight_decay : 0.0
max_epoch : 100
batch_size : 128
save_step : 40
data_ratio : 0.8
model:
feature_size : 144
timestep : 108
num_chords : 25
# num_chords : 170
input_dropout : 0.2
layer_dropout : 0.2
attention_dropout : 0.2
relu_dropout : 0.2
num_layers : 8
num_heads : 4
hidden_size : 128
total_key_depth : 128
total_value_depth : 128
filter_size : 128
loss : 'ce'
probs_out : False
path:
ckpt_path : 'model'
result_path : 'result'
asset_path : '/data/music/chord_recognition/jayg996/assets'
root_path : '/data/music/chord_recognition'
================================================
FILE: test/btc_model.pt
================================================
[File too large to display: 11.6 MB]
================================================
FILE: test/btc_model_large_voca.pt
================================================
[File too large to display: 11.7 MB]
================================================
FILE: test.py
================================================
import os
import mir_eval
import pretty_midi as pm
from utils import logger
from btc_model import *
from utils.mir_eval_modules import audio_file_to_features, idx2chord, idx2voca_chord, get_audio_paths
import argparse
import warnings
warnings.filterwarnings('ignore')
logger.logging_verbosity(1)
use_cuda = torch.cuda.is_available()
device = torch.device("cuda" if use_cuda else "cpu")
# hyperparameters
parser = argparse.ArgumentParser()
parser.add_argument('--voca', default=True, type=lambda x: (str(x).lower() == 'true'))
parser.add_argument('--audio_dir', type=str, default='./test')
parser.add_argument('--save_dir', type=str, default='./test')
args = parser.parse_args()
config = HParams.load("run_config.yaml")
if args.voca is True:
config.feature['large_voca'] = True
config.model['num_chords'] = 170
model_file = './test/btc_model_large_voca.pt'
idx_to_chord = idx2voca_chord()
logger.info("label type: large voca")
else:
model_file = './test/btc_model.pt'
idx_to_chord = idx2chord
logger.info("label type: Major and minor")
model = BTC_model(config=config.model).to(device)
# Load model
if os.path.isfile(model_file):
checkpoint = torch.load(model_file)
mean = checkpoint['mean']
std = checkpoint['std']
model.load_state_dict(checkpoint['model'])
logger.info("restore model")
# Audio files with format of wav and mp3
audio_paths = get_audio_paths(args.audio_dir)
# Chord recognition and save lab file
for i, audio_path in enumerate(audio_paths):
logger.info("======== %d of %d in progress ========" % (i + 1, len(audio_paths)))
# Load mp3
feature, feature_per_second, song_length_second = audio_file_to_features(audio_path, config)
logger.info("audio file loaded and feature computation success : %s" % audio_path)
# Majmin type chord recognition
feature = feature.T
feature = (feature - mean) / std
time_unit = feature_per_second
n_timestep = config.model['timestep']
num_pad = n_timestep - (feature.shape[0] % n_timestep)
feature = np.pad(feature, ((0, num_pad), (0, 0)), mode="constant", constant_values=0)
num_instance = feature.shape[0] // n_timestep
start_time = 0.0
lines = []
with torch.no_grad():
model.eval()
feature = torch.tensor(feature, dtype=torch.float32).unsqueeze(0).to(device)
for t in range(num_instance):
self_attn_output, _ = model.self_attn_layers(feature[:, n_timestep * t:n_timestep * (t + 1), :])
prediction, _ = model.output_layer(self_attn_output)
prediction = prediction.squeeze()
for i in range(n_timestep):
if t == 0 and i == 0:
prev_chord = prediction[i].item()
continue
if prediction[i].item() != prev_chord:
lines.append(
'%.3f %.3f %s\n' % (start_time, time_unit * (n_timestep * t + i), idx_to_chord[prev_chord]))
start_time = time_unit * (n_timestep * t + i)
prev_chord = prediction[i].item()
if t == num_instance - 1 and i + num_pad == n_timestep:
if start_time != time_unit * (n_timestep * t + i):
lines.append('%.3f %.3f %s\n' % (start_time, time_unit * (n_timestep * t + i), idx_to_chord[prev_chord]))
break
# lab file write
if not os.path.exists(args.save_dir):
os.makedirs(args.save_dir)
save_path = os.path.join(args.save_dir, os.path.split(audio_path)[-1].replace('.mp3', '').replace('.wav', '') + '.lab')
with open(save_path, 'w') as f:
for line in lines:
f.write(line)
logger.info("label file saved : %s" % save_path)
# lab file to midi file
starts, ends, pitchs = list(), list(), list()
intervals, chords = mir_eval.io.load_labeled_intervals(save_path)
for p in range(12):
for i, (interval, chord) in enumerate(zip(intervals, chords)):
root_num, relative_bitmap, _ = mir_eval.chord.encode(chord)
tmp_label = mir_eval.chord.rotate_bitmap_to_root(relative_bitmap, root_num)[p]
if i == 0:
start_time = interval[0]
label = tmp_label
continue
if tmp_label != label:
if label == 1.0:
starts.append(start_time), ends.append(interval[0]), pitchs.append(p + 48)
start_time = interval[0]
label = tmp_label
if i == (len(intervals) - 1):
if label == 1.0:
starts.append(start_time), ends.append(interval[1]), pitchs.append(p + 48)
midi = pm.PrettyMIDI()
instrument = pm.Instrument(program=0)
for start, end, pitch in zip(starts, ends, pitchs):
pm_note = pm.Note(velocity=120, pitch=pitch, start=start, end=end)
instrument.notes.append(pm_note)
midi.instruments.append(instrument)
midi.write(save_path.replace('.lab', '.midi'))
================================================
FILE: train.py
================================================
import os
from torch import optim
from utils import logger
from audio_dataset import AudioDataset, AudioDataLoader
from utils.tf_logger import TF_Logger
from btc_model import *
from baseline_models import CNN, CRNN
from utils.hparams import HParams
import argparse
from utils.pytorch_utils import adjusting_learning_rate
from utils.mir_eval_modules import root_majmin_score_calculation, large_voca_score_calculation
import warnings
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=FutureWarning)
logger.logging_verbosity(1)
use_cuda = torch.cuda.is_available()
device = torch.device("cuda" if use_cuda else "cpu")
parser = argparse.ArgumentParser()
parser.add_argument('--index', type=int, help='Experiment Number', default='e')
parser.add_argument('--kfold', type=int, help='5 fold (0,1,2,3,4)',default='e')
parser.add_argument('--voca', type=bool, help='large voca is True', default=False)
parser.add_argument('--model', type=str, help='btc, cnn, crnn', default='btc')
parser.add_argument('--dataset1', type=str, help='Dataset', default='isophonic')
parser.add_argument('--dataset2', type=str, help='Dataset', default='uspop')
parser.add_argument('--dataset3', type=str, help='Dataset', default='robbiewilliams')
parser.add_argument('--restore_epoch', type=int, default=1000)
parser.add_argument('--early_stop', type=bool, help='no improvement during 10 epoch -> stop', default=True)
args = parser.parse_args()
config = HParams.load("run_config.yaml")
if args.voca == True:
config.feature['large_voca'] = True
config.model['num_chords'] = 170
# Result save path
asset_path = config.path['asset_path']
ckpt_path = config.path['ckpt_path']
result_path = config.path['result_path']
restore_epoch = args.restore_epoch
experiment_num = str(args.index)
ckpt_file_name = 'idx_'+experiment_num+'_%03d.pth.tar'
tf_logger = TF_Logger(os.path.join(asset_path, 'tensorboard', 'idx_'+experiment_num))
logger.info("==== Experiment Number : %d " % args.index)
if args.model == 'cnn':
config.experiment['batch_size'] = 10
# Data loader
train_dataset1 = AudioDataset(config, root_dir=config.path['root_path'], dataset_names=(args.dataset1,), num_workers=20, preprocessing=False, train=True, kfold=args.kfold)
train_dataset2 = AudioDataset(config, root_dir=config.path['root_path'], dataset_names=(args.dataset2,), num_workers=20, preprocessing=False, train=True, kfold=args.kfold)
train_dataset3 = AudioDataset(config, root_dir=config.path['root_path'], dataset_names=(args.dataset3,), num_workers=20, preprocessing=False, train=True, kfold=args.kfold)
train_dataset = train_dataset1.__add__(train_dataset2).__add__(train_dataset3)
valid_dataset1 = AudioDataset(config, root_dir=config.path['root_path'], dataset_names=(args.dataset1,), preprocessing=False, train=False, kfold=args.kfold)
valid_dataset2 = AudioDataset(config, root_dir=config.path['root_path'], dataset_names=(args.dataset2,), preprocessing=False, train=False, kfold=args.kfold)
valid_dataset3 = AudioDataset(config, root_dir=config.path['root_path'], dataset_names=(args.dataset3,), preprocessing=False, train=False, kfold=args.kfold)
valid_dataset = valid_dataset1.__add__(valid_dataset2).__add__(valid_dataset3)
train_dataloader = AudioDataLoader(dataset=train_dataset, batch_size=config.experiment['batch_size'], drop_last=False, shuffle=True)
valid_dataloader = AudioDataLoader(dataset=valid_dataset, batch_size=config.experiment['batch_size'], drop_last=False)
# Model and Optimizer
if args.model == 'cnn':
model = CNN(config=config.model).to(device)
elif args.model == 'crnn':
model = CRNN(config=config.model).to(device)
elif args.model == 'btc':
model = BTC_model(config=config.model).to(device)
else: raise NotImplementedError
optimizer = optim.Adam(model.parameters(), lr=config.experiment['learning_rate'], weight_decay=config.experiment['weight_decay'], betas=(0.9, 0.98), eps=1e-9)
# Make asset directory
if not os.path.exists(os.path.join(asset_path, ckpt_path)):
os.makedirs(os.path.join(asset_path, ckpt_path))
os.makedirs(os.path.join(asset_path, result_path))
# Load model
if os.path.isfile(os.path.join(asset_path, ckpt_path, ckpt_file_name % restore_epoch)):
checkpoint = torch.load(os.path.join(asset_path, ckpt_path, ckpt_file_name % restore_epoch))
model.load_state_dict(checkpoint['model'])
optimizer.load_state_dict(checkpoint['optimizer'])
epoch = checkpoint['epoch']
logger.info("restore model with %d epochs" % restore_epoch)
else:
logger.info("no checkpoint with %d epochs" % restore_epoch)
restore_epoch = 0
# Global mean and variance calculate
mp3_config = config.mp3
feature_config = config.feature
mp3_string = "%d_%.1f_%.1f" % (mp3_config['song_hz'], mp3_config['inst_len'], mp3_config['skip_interval'])
feature_string = "_%s_%d_%d_%d_" % ('cqt', feature_config['n_bins'], feature_config['bins_per_octave'], feature_config['hop_length'])
z_path = os.path.join(config.path['root_path'], 'result', mp3_string + feature_string + 'mix_kfold_'+ str(args.kfold) +'_normalization.pt')
if os.path.exists(z_path):
normalization = torch.load(z_path)
mean = normalization['mean']
std = normalization['std']
logger.info("Global mean and std (k fold index %d) load complete" % args.kfold)
else:
mean = 0
square_mean = 0
k = 0
for i, data in enumerate(train_dataloader):
features, input_percentages, chords, collapsed_chords, chord_lens, boundaries = data
features = features.to(device)
mean += torch.mean(features).item()
square_mean += torch.mean(features.pow(2)).item()
k += 1
square_mean = square_mean / k
mean = mean / k
std = np.sqrt(square_mean - mean * mean)
normalization = dict()
normalization['mean'] = mean
normalization['std'] = std
torch.save(normalization, z_path)
logger.info("Global mean and std (training set, k fold index %d) calculation complete" % args.kfold)
current_step = 0
best_acc = 0
before_acc = 0
early_stop_idx = 0
for epoch in range(restore_epoch, config.experiment['max_epoch']):
# Training
model.train()
train_loss_list = []
total = 0.
correct = 0.
second_correct = 0.
for i, data in enumerate(train_dataloader):
features, input_percentages, chords, collapsed_chords, chord_lens, boundaries = data
features, chords = features.to(device), chords.to(device)
features.requires_grad = True
features = (features - mean) / std
# forward
features = features.squeeze(1).permute(0,2,1)
optimizer.zero_grad()
prediction, total_loss, weights, second = model(features, chords)
# save accuracy and loss
total += chords.size(0)
correct += (prediction == chords).type_as(chords).sum()
second_correct += (second == chords).type_as(chords).sum()
train_loss_list.append(total_loss.item())
# optimize step
total_loss.backward()
optimizer.step()
current_step += 1
# logging loss and accuracy using tensorboard
result = {'loss/tr': np.mean(train_loss_list), 'acc/tr': correct.item() / total, 'top2/tr': (correct.item()+second_correct.item()) / total}
for tag, value in result.items(): tf_logger.scalar_summary(tag, value, epoch+1)
logger.info("training loss for %d epoch: %.4f" % (epoch + 1, np.mean(train_loss_list)))
logger.info("training accuracy for %d epoch: %.4f" % (epoch + 1, (correct.item() / total)))
logger.info("training top2 accuracy for %d epoch: %.4f" % (epoch + 1, ((correct.item() + second_correct.item()) / total)))
# Validation
with torch.no_grad():
model.eval()
val_total = 0.
val_correct = 0.
val_second_correct = 0.
validation_loss = 0
n = 0
for i, data in enumerate(valid_dataloader):
val_features, val_input_percentages, val_chords, val_collapsed_chords, val_chord_lens, val_boundaries = data
val_features, val_chords = val_features.to(device), val_chords.to(device)
val_features = (val_features - mean) / std
val_features = val_features.squeeze(1).permute(0, 2, 1)
val_prediction, val_loss, weights, val_second = model(val_features, val_chords)
val_total += val_chords.size(0)
val_correct += (val_prediction == val_chords).type_as(val_chords).sum()
val_second_correct += (val_second == val_chords).type_as(val_chords).sum()
validation_loss += val_loss.item()
n += 1
# logging loss and accuracy using tensorboard
validation_loss /= n
result = {'loss/val': validation_loss, 'acc/val': val_correct.item() / val_total, 'top2/val': (val_correct.item()+val_second_correct.item()) / val_total}
for tag, value in result.items(): tf_logger.scalar_summary(tag, value, epoch + 1)
logger.info("validation loss(%d): %.4f" % (epoch + 1, validation_loss))
logger.info("validation accuracy(%d): %.4f" % (epoch + 1, (val_correct.item() / val_total)))
logger.info("validation top2 accuracy(%d): %.4f" % (epoch + 1, ((val_correct.item() + val_second_correct.item()) / val_total)))
current_acc = val_correct.item() / val_total
if best_acc < val_correct.item() / val_total:
early_stop_idx = 0
best_acc = val_correct.item() / val_total
logger.info('==== best accuracy is %.4f and epoch is %d' % (best_acc, epoch + 1))
logger.info('saving model, Epoch %d, step %d' % (epoch + 1, current_step + 1))
model_save_path = os.path.join(asset_path, 'model', ckpt_file_name % (epoch + 1))
state_dict = {'model': model.state_dict(),'optimizer': optimizer.state_dict(),'epoch': epoch}
torch.save(state_dict, model_save_path)
last_best_epoch = epoch + 1
# save model
elif (epoch + 1) % config.experiment['save_step'] == 0:
logger.info('saving model, Epoch %d, step %d' % (epoch + 1, current_step + 1))
model_save_path = os.path.join(asset_path, 'model', ckpt_file_name % (epoch + 1))
state_dict = {'model': model.state_dict(),'optimizer': optimizer.state_dict(),'epoch': epoch}
torch.save(state_dict, model_save_path)
early_stop_idx += 1
else:
early_stop_idx += 1
if (args.early_stop == True) and (early_stop_idx > 9):
logger.info('==== early stopped and epoch is %d' % (epoch + 1))
break
# learning rate decay
if before_acc > current_acc:
adjusting_learning_rate(optimizer=optimizer, factor=0.95, min_lr=5e-6)
before_acc = current_acc
# Load model
if os.path.isfile(os.path.join(asset_path, ckpt_path, ckpt_file_name % last_best_epoch)):
checkpoint = torch.load(os.path.join(asset_path, ckpt_path, ckpt_file_name % last_best_epoch))
model.load_state_dict(checkpoint['model'])
logger.info("restore model with %d epochs" % last_best_epoch)
else:
raise NotImplementedError
# score Validation
if args.voca == True:
score_metrics = ['root', 'thirds', 'triads', 'sevenths', 'tetrads', 'majmin', 'mirex']
score_list_dict1, song_length_list1, average_score_dict1 = large_voca_score_calculation(valid_dataset=valid_dataset1, config=config, model=model, model_type=args.model, mean=mean, std=std, device=device)
score_list_dict2, song_length_list2, average_score_dict2 = large_voca_score_calculation(valid_dataset=valid_dataset2, config=config, model=model, model_type=args.model, mean=mean, std=std, device=device)
score_list_dict3, song_length_list3, average_score_dict3 = large_voca_score_calculation(valid_dataset=valid_dataset3, config=config, model=model, model_type=args.model, mean=mean, std=std, device=device)
for m in score_metrics:
average_score = (np.sum(song_length_list1) * average_score_dict1[m] + np.sum(song_length_list2) *average_score_dict2[m] + np.sum(song_length_list3) * average_score_dict3[m]) / (np.sum(song_length_list1) + np.sum(song_length_list2) + np.sum(song_length_list3))
logger.info('==== %s score 1 is %.4f' % (m, average_score_dict1[m]))
logger.info('==== %s score 2 is %.4f' % (m, average_score_dict2[m]))
logger.info('==== %s score 3 is %.4f' % (m, average_score_dict3[m]))
logger.info('==== %s mix average score is %.4f' % (m, average_score))
else:
score_metrics = ['root', 'majmin']
score_list_dict1, song_length_list1, average_score_dict1 = root_majmin_score_calculation(valid_dataset=valid_dataset1, config=config, model=model, model_type=args.model, mean=mean, std=std, device=device)
score_list_dict2, song_length_list2, average_score_dict2 = root_majmin_score_calculation(valid_dataset=valid_dataset2, config=config, model=model, model_type=args.model, mean=mean, std=std, device=device)
score_list_dict3, song_length_list3, average_score_dict3 = root_majmin_score_calculation(valid_dataset=valid_dataset3, config=config, model=model, model_type=args.model, mean=mean, std=std, device=device)
for m in score_metrics:
average_score = (np.sum(song_length_list1) * average_score_dict1[m] + np.sum(song_length_list2) *average_score_dict2[m] + np.sum(song_length_list3) * average_score_dict3[m]) / (np.sum(song_length_list1) + np.sum(song_length_list2) + np.sum(song_length_list3))
logger.info('==== %s score 1 is %.4f' % (m, average_score_dict1[m]))
logger.info('==== %s score 2 is %.4f' % (m, average_score_dict2[m]))
logger.info('==== %s score 3 is %.4f' % (m, average_score_dict3[m]))
logger.info('==== %s mix average score is %.4f' % (m, average_score))
================================================
FILE: train_crf.py
================================================
import os
from torch import optim
from utils import logger
from audio_dataset import AudioDataset, AudioDataLoader
from utils.tf_logger import TF_Logger
from btc_model import *
from baseline_models import CNN, CRNN, Crf
from utils.hparams import HParams
import argparse
from utils.pytorch_utils import adjusting_learning_rate
from utils.mir_eval_modules import large_voca_score_calculation_crf, root_majmin_score_calculation_crf
import warnings
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=FutureWarning)
logger.logging_verbosity(1)
use_cuda = torch.cuda.is_available()
device = torch.device("cuda" if use_cuda else "cpu")
parser = argparse.ArgumentParser()
parser.add_argument('--index', type=int, help='Experiment Number', default='e')
parser.add_argument('--kfold', type=int, help='5 fold (0,1,2,3,4)',default='e')
parser.add_argument('--voca', type=bool, help='large voca is True', default=False)
parser.add_argument('--model', type=str, default='crf')
parser.add_argument('--pre_model', type=str, help='btc, cnn, crnn', default='e')
parser.add_argument('--dataset1', type=str, help='Dataset', default='isophonic_221')
parser.add_argument('--dataset2', type=str, help='Dataset', default='uspop_185')
parser.add_argument('--dataset3', type=str, help='Dataset', default='robbiewilliams')
parser.add_argument('--restore_epoch', type=int, default=1000)
parser.add_argument('--early_stop', type=bool, help='no improvement during 10 epoch -> stop', default=True)
args = parser.parse_args()
config = HParams.load("run_config.yaml")
if args.voca == True:
config.feature['large_voca'] = True
config.model['num_chords'] = 170
config.model['probs_out'] = True
# Result save path
asset_path = config.path['asset_path']
ckpt_path = config.path['ckpt_path']
result_path = config.path['result_path']
restore_epoch = args.restore_epoch
experiment_num = str(args.index)
ckpt_file_name = 'idx_'+experiment_num+'_%03d.pth.tar'
tf_logger = TF_Logger(os.path.join(asset_path, 'tensorboard', 'idx_'+experiment_num))
logger.info("==== Experiment Number : %d " % args.index)
if args.pre_model == 'cnn':
config.experiment['batch_size'] = 20
# Data loader
train_dataset1 = AudioDataset(config, root_dir=config.path['root_path'], dataset_names=(args.dataset1,), num_workers=20, preprocessing=False, train=True, kfold=args.kfold)
train_dataset2 = AudioDataset(config, root_dir=config.path['root_path'], dataset_names=(args.dataset2,), num_workers=20, preprocessing=False, train=True, kfold=args.kfold)
train_dataset3 = AudioDataset(config, root_dir=config.path['root_path'], dataset_names=(args.dataset3,), num_workers=20, preprocessing=False, train=True, kfold=args.kfold)
train_dataset = train_dataset1.__add__(train_dataset2).__add__(train_dataset3)
valid_dataset1 = AudioDataset(config, root_dir=config.path['root_path'], dataset_names=(args.dataset1,), preprocessing=False, train=False, kfold=args.kfold)
valid_dataset2 = AudioDataset(config, root_dir=config.path['root_path'], dataset_names=(args.dataset2,), preprocessing=False, train=False, kfold=args.kfold)
valid_dataset3 = AudioDataset(config, root_dir=config.path['root_path'], dataset_names=(args.dataset3,), preprocessing=False, train=False, kfold=args.kfold)
valid_dataset = valid_dataset1.__add__(valid_dataset2).__add__(valid_dataset3)
train_dataloader = AudioDataLoader(dataset=train_dataset, batch_size=config.experiment['batch_size'], drop_last=False, shuffle=True)
valid_dataloader = AudioDataLoader(dataset=valid_dataset, batch_size=config.experiment['batch_size'], drop_last=False)
# Model and Optimizer
if args.pre_model == 'cnn':
pre_model = CNN(config=config.model).to(device)
elif args.pre_model == 'crnn':
pre_model = CRNN(config=config.model).to(device)
elif args.pre_model == 'btc':
pre_model = BTC_model(config=config.model).to(device)
else: raise NotImplementedError
if args.pre_model == 'cnn':
if args.voca == False:
if args.kfold == 0:
load_ckpt_file_name = 'idx_0_%03d.pth.tar'
load_restore_epoch = 10
else:
if args.kfold == 0:
load_ckpt_file_name = 'idx_1_%03d.pth.tar'
load_restore_epoch = 10
else:
raise NotImplementedError
if os.path.isfile(os.path.join(asset_path, ckpt_path, load_ckpt_file_name % load_restore_epoch)):
checkpoint = torch.load(os.path.join(asset_path, ckpt_path, load_ckpt_file_name % load_restore_epoch))
pre_model.load_state_dict(checkpoint['model'])
logger.info("restore pre model with %d epochs" % load_restore_epoch)
else:
raise NotImplementedError
# Fix Pre Model Parameters
for param in pre_model.parameters():
param.requires_grad = False
# Crf Model and Optimizer
crf = Crf(num_chords=config.model['num_chords'], timestep=config.model['timestep']).to(device)
optimizer = optim.Adam(filter(lambda p: p.requires_grad, crf.parameters()), lr=0.01, weight_decay=config.experiment['weight_decay'], betas=(0.9, 0.98), eps=1e-9)
# Make asset directory
if not os.path.exists(os.path.join(asset_path, ckpt_path)):
os.makedirs(os.path.join(asset_path, ckpt_path))
os.makedirs(os.path.join(asset_path, result_path))
# Load model
if os.path.isfile(os.path.join(asset_path, ckpt_path, ckpt_file_name % restore_epoch)):
checkpoint = torch.load(os.path.join(asset_path, ckpt_path, ckpt_file_name % restore_epoch))
crf.load_state_dict(checkpoint['model'])
optimizer.load_state_dict(checkpoint['optimizer'])
epoch = checkpoint['epoch']
logger.info("restore model with %d epochs" % restore_epoch)
else:
logger.info("no checkpoint with %d epochs" % restore_epoch)
restore_epoch = 0
# Global mean and variance calculate
mp3_config = config.mp3
feature_config = config.feature
mp3_string = "%d_%.1f_%.1f" % (mp3_config['song_hz'], mp3_config['inst_len'], mp3_config['skip_interval'])
feature_string = "_%s_%d_%d_%d_" % ('cqt', feature_config['n_bins'], feature_config['bins_per_octave'], feature_config['hop_length'])
z_path = os.path.join(config.path['root_path'], 'result', mp3_string + feature_string + 'mix_kfold_'+ str(args.kfold) +'_normalization.pt')
if os.path.exists(z_path):
normalization = torch.load(z_path)
mean = normalization['mean']
std = normalization['std']
logger.info("Global mean and std (k fold index %d) load complete" % args.kfold)
else:
mean = 0
square_mean = 0
k = 0
for i, data in enumerate(train_dataloader):
features, input_percentages, chords, collapsed_chords, chord_lens, boundaries = data
features = features.to(device)
mean += torch.mean(features).item()
square_mean += torch.mean(features.pow(2)).item()
k += 1
square_mean = square_mean / k
mean = mean / k
std = np.sqrt(square_mean - mean * mean)
normalization = dict()
normalization['mean'] = mean
normalization['std'] = std
torch.save(normalization, z_path)
logger.info("Global mean and std (training set, k fold index %d) calculation complete" % args.kfold)
current_step = 0
best_acc = 0
before_acc = 0
early_stop_idx = 0
pre_model.eval()
for epoch in range(restore_epoch, config.experiment['max_epoch']):
# Training
crf.train()
train_loss_list = []
total = 0.
correct = 0.
second_correct = 0.
for i, data in enumerate(train_dataloader):
features, input_percentages, chords, collapsed_chords, chord_lens, boundaries = data
features, chords = features.to(device), chords.to(device)
features.requires_grad = True
features = (features - mean) / std
# forward
features = features.squeeze(1).permute(0,2,1)
optimizer.zero_grad()
logits = pre_model(features, chords)
if args.pre_model == 'crnn':
logits = logits.detach()
logits.requires_grad = True
prediction, total_loss = crf(logits, chords)
# save accuracy and loss
total += chords.size(0)
correct += (prediction == chords).type_as(chords).sum()
train_loss_list.append(total_loss.item())
# optimize step
total_loss.backward()
optimizer.step()
current_step += 1
# logging loss and accuracy using tensorboard
result = {'loss/tr': np.mean(train_loss_list), 'acc/tr': correct.item() / total}
for tag, value in result.items(): tf_logger.scalar_summary(tag, value, epoch+1)
logger.info("training loss for %d epoch: %.4f" % (epoch + 1, np.mean(train_loss_list)))
logger.info("training accuracy for %d epoch: %.4f" % (epoch + 1, (correct.item() / total)))
# Validation
with torch.no_grad():
crf.eval()
val_total = 0.
val_correct = 0.
val_second_correct = 0.
validation_loss = 0
n = 0
for i, data in enumerate(valid_dataloader):
val_features, val_input_percentages, val_chords, val_collapsed_chords, val_chord_lens, val_boundaries = data
val_features, val_chords = val_features.to(device), val_chords.to(device)
val_features = (val_features - mean) / std
val_features = val_features.squeeze(1).permute(0, 2, 1)
val_logits = pre_model(val_features, val_chords)
val_prediction, val_loss = crf(val_logits, val_chords)
val_total += val_chords.size(0)
val_correct += (val_prediction == val_chords).type_as(val_chords).sum()
validation_loss += val_loss.item()
n += 1
# logging loss and accuracy using tensorboard
validation_loss /= n
result = {'loss/val': validation_loss, 'acc/val': val_correct.item() / val_total}
for tag, value in result.items(): tf_logger.scalar_summary(tag, value, epoch + 1)
logger.info("validation loss(%d): %.4f" % (epoch + 1, validation_loss))
logger.info("validation accuracy(%d): %.4f" % (epoch + 1, (val_correct.item() / val_total)))
current_acc = val_correct.item() / val_total
if best_acc < val_correct.item() / val_total:
early_stop_idx = 0
best_acc = val_correct.item() / val_total
logger.info('==== best accuracy is %.4f and epoch is %d' % (best_acc, epoch + 1))
logger.info('saving model, Epoch %d, step %d' % (epoch + 1, current_step + 1))
model_save_path = os.path.join(asset_path, 'model', ckpt_file_name % (epoch + 1))
state_dict = {'model': crf.state_dict(),'optimizer': optimizer.state_dict(),'epoch': epoch}
torch.save(state_dict, model_save_path)
last_best_epoch = epoch + 1
# save model
elif (epoch + 1) % config.experiment['save_step'] == 0:
logger.info('saving model, Epoch %d, step %d' % (epoch + 1, current_step + 1))
model_save_path = os.path.join(asset_path, 'model', ckpt_file_name % (epoch + 1))
state_dict = {'model': crf.state_dict(),'optimizer': optimizer.state_dict(),'epoch': epoch}
torch.save(state_dict, model_save_path)
early_stop_idx += 1
else:
early_stop_idx += 1
if (args.early_stop == True) and (early_stop_idx > 5):
logger.info('==== early stopped and epoch is %d' % (epoch + 1))
break
# learning rate decay
if before_acc > current_acc:
adjusting_learning_rate(optimizer=optimizer, factor=0.95, min_lr=5e-6)
before_acc = current_acc
# Load model
if os.path.isfile(os.path.join(asset_path, ckpt_path, ckpt_file_name % last_best_epoch)):
checkpoint = torch.load(os.path.join(asset_path, ckpt_path, ckpt_file_name % last_best_epoch))
crf.load_state_dict(checkpoint['model'])
logger.info("last best restore model with %d epochs" % last_best_epoch)
else:
raise NotImplementedError
# score Validation
if args.voca == True:
score_metrics = ['root', 'thirds', 'triads', 'sevenths', 'tetrads', 'majmin', 'mirex']
score_list_dict1, song_length_list1, average_score_dict1 = large_voca_score_calculation_crf(valid_dataset=valid_dataset1, config=config, pre_model=pre_model, model=crf, model_type=args.pre_model, mean=mean, std=std, device=device)
score_list_dict2, song_length_list2, average_score_dict2 = large_voca_score_calculation_crf(valid_dataset=valid_dataset2, config=config, pre_model=pre_model, model=crf, model_type=args.pre_model, mean=mean, std=std, device=device)
score_list_dict3, song_length_list3, average_score_dict3 = large_voca_score_calculation_crf(valid_dataset=valid_dataset3, config=config, pre_model=pre_model, model=crf, model_type=args.pre_model, mean=mean, std=std, device=device)
for m in score_metrics:
average_score = (np.sum(song_length_list1) * average_score_dict1[m] + np.sum(song_length_list2) *average_score_dict2[m] + np.sum(song_length_list3) * average_score_dict3[m]) / (np.sum(song_length_list1) + np.sum(song_length_list2) + np.sum(song_length_list3))
logger.info('==== %s score 1 is %.4f' % (m, average_score_dict1[m]))
logger.info('==== %s score 2 is %.4f' % (m, average_score_dict2[m]))
logger.info('==== %s score 3 is %.4f' % (m, average_score_dict3[m]))
logger.info('==== %s mix average score is %.4f' % (m, average_score))
else:
score_metrics = ['root', 'majmin']
score_list_dict1, song_length_list1, average_score_dict1 = root_majmin_score_calculation_crf(valid_dataset=valid_dataset1, config=config, pre_model=pre_model, model=crf, model_type=args.pre_model, mean=mean, std=std, device=device)
score_list_dict2, song_length_list2, average_score_dict2 = root_majmin_score_calculation_crf(valid_dataset=valid_dataset2, config=config, pre_model=pre_model, model=crf, model_type=args.pre_model, mean=mean, std=std, device=device)
score_list_dict3, song_length_list3, average_score_dict3 = root_majmin_score_calculation_crf(valid_dataset=valid_dataset3, config=config, pre_model=pre_model, model=crf, model_type=args.pre_model, mean=mean, std=std, device=device)
for m in score_metrics:
average_score = (np.sum(song_length_list1) * average_score_dict1[m] + np.sum(song_length_list2) *average_score_dict2[m] + np.sum(song_length_list3) * average_score_dict3[m]) / (np.sum(song_length_list1) + np.sum(song_length_list2) + np.sum(song_length_list3))
logger.info('==== %s score 1 is %.4f' % (m, average_score_dict1[m]))
logger.info('==== %s score 2 is %.4f' % (m, average_score_dict2[m]))
logger.info('==== %s score 3 is %.4f' % (m, average_score_dict3[m]))
logger.info('==== %s mix average score is %.4f' % (m, average_score))
================================================
FILE: utils/__init__.py
================================================
================================================
FILE: utils/chords.py
================================================
# encoding: utf-8
"""
This module contains chord evaluation functionality.
It provides the evaluation measures used for the MIREX ACE task, and
tries to follow [1]_ and [2]_ as closely as possible.
Notes
-----
This implementation tries to follow the references and their implementation
(e.g., https://github.com/jpauwels/MusOOEvaluator for [2]_). However, there
are some known (and possibly some unknown) differences. If you find one not
listed in the following, please file an issue:
- Detected chord segments are adjusted to fit the length of the annotations.
In particular, this means that, if necessary, filler segments of 'no chord'
are added at beginnings and ends. This can result in different segmentation
scores compared to the original implementation.
References
----------
.. [1] Christopher Harte, "Towards Automatic Extraction of Harmony Information
from Music Signals." Dissertation,
Department for Electronic Engineering, Queen Mary University of London,
2010.
.. [2] Johan Pauwels and Geoffroy Peeters.
"Evaluating Automatically Estimated Chord Sequences."
In Proceedings of ICASSP 2013, Vancouver, Canada, 2013.
"""
import numpy as np
import pandas as pd
import mir_eval
CHORD_DTYPE = [('root', np.int),
('bass', np.int),
('intervals', np.int, (12,)),
('is_major',np.bool)]
CHORD_ANN_DTYPE = [('start', np.float),
('end', np.float),
('chord', CHORD_DTYPE)]
NO_CHORD = (-1, -1, np.zeros(12, dtype=np.int), False)
UNKNOWN_CHORD = (-1, -1, np.ones(12, dtype=np.int) * -1, False)
PITCH_CLASS = ['C', 'C#', 'D', 'D#', 'E', 'F', 'F#', 'G', 'G#', 'A', 'A#', 'B']
def idx_to_chord(idx):
if idx == 24:
return "-"
elif idx == 25:
return u"\u03B5"
minmaj = idx % 2
root = idx // 2
return PITCH_CLASS[root] + ("M" if minmaj == 0 else "m")
class Chords:
def __init__(self):
self._shorthands = {
'maj': self.interval_list('(1,3,5)'),
'min': self.interval_list('(1,b3,5)'),
'dim': self.interval_list('(1,b3,b5)'),
'aug': self.interval_list('(1,3,#5)'),
'maj7': self.interval_list('(1,3,5,7)'),
'min7': self.interval_list('(1,b3,5,b7)'),
'7': self.interval_list('(1,3,5,b7)'),
'6': self.interval_list('(1,6)'), # custom
'5': self.interval_list('(1,5)'),
'4': self.interval_list('(1,4)'), # custom
'1': self.interval_list('(1)'),
'dim7': self.interval_list('(1,b3,b5,bb7)'),
'hdim7': self.interval_list('(1,b3,b5,b7)'),
'minmaj7': self.interval_list('(1,b3,5,7)'),
'maj6': self.interval_list('(1,3,5,6)'),
'min6': self.interval_list('(1,b3,5,6)'),
'9': self.interval_list('(1,3,5,b7,9)'),
'maj9': self.interval_list('(1,3,5,7,9)'),
'min9': self.interval_list('(1,b3,5,b7,9)'),
'sus2': self.interval_list('(1,2,5)'),
'sus4': self.interval_list('(1,4,5)'),
'11': self.interval_list('(1,3,5,b7,9,11)'),
'min11': self.interval_list('(1,b3,5,b7,9,11)'),
'13': self.interval_list('(1,3,5,b7,13)'),
'maj13': self.interval_list('(1,3,5,7,13)'),
'min13': self.interval_list('(1,b3,5,b7,13)')
}
def chords(self, labels):
"""
Transform a list of chord labels into an array of internal numeric
representations.
Parameters
----------
labels : list
List of chord labels (str).
Returns
-------
chords : numpy.array
Structured array with columns 'root', 'bass', and 'intervals',
containing a numeric representation of chords.
"""
crds = np.zeros(len(labels), dtype=CHORD_DTYPE)
cache = {}
for i, lbl in enumerate(labels):
cv = cache.get(lbl, None)
if cv is None:
cv = self.chord(lbl)
cache[lbl] = cv
crds[i] = cv
return crds
def label_error_modify(self, label):
if label == 'Emin/4': label = 'E:min/4'
elif label == 'A7/3': label = 'A:7/3'
elif label == 'Bb7/3': label = 'Bb:7/3'
elif label == 'Bb7/5': label = 'Bb:7/5'
elif label.find(':') == -1:
if label.find('min') != -1:
label = label[:label.find('min')] + ':' + label[label.find('min'):]
return label
def chord(self, label):
"""
Transform a chord label into the internal numeric represenation of
(root, bass, intervals array).
Parameters
----------
label : str
Chord label.
Returns
-------
chord : tuple
Numeric representation of the chord: (root, bass, intervals array).
"""
try:
is_major = False
if label == 'N':
return NO_CHORD
if label == 'X':
return UNKNOWN_CHORD
label = self.label_error_modify(label)
c_idx = label.find(':')
s_idx = label.find('/')
if c_idx == -1:
quality_str = 'maj'
if s_idx == -1:
root_str = label
bass_str = ''
else:
root_str = label[:s_idx]
bass_str = label[s_idx + 1:]
else:
root_str = label[:c_idx]
if s_idx == -1:
quality_str = label[c_idx + 1:]
bass_str = ''
else:
quality_str = label[c_idx + 1:s_idx]
bass_str = label[s_idx + 1:]
root = self.pitch(root_str)
bass = self.interval(bass_str) if bass_str else 0
ivs = self.chord_intervals(quality_str)
ivs[bass] = 1
if 'min' in quality_str:
is_major = False
else:
is_major = True
except Exception as e:
print(e, label)
return root, bass, ivs, is_major
_l = [0, 1, 1, 0, 1, 1, 1]
_chroma_id = (np.arange(len(_l) * 2) + 1) + np.array(_l + _l).cumsum() - 1
def modify(self, base_pitch, modifier):
"""
Modify a pitch class in integer representation by a given modifier string.
A modifier string can be any sequence of 'b' (one semitone down)
and '#' (one semitone up).
Parameters
----------
base_pitch : int
Pitch class as integer.
modifier : str
String of modifiers ('b' or '#').
Returns
-------
modified_pitch : int
Modified root note.
"""
for m in modifier:
if m == 'b':
base_pitch -= 1
elif m == '#':
base_pitch += 1
else:
raise ValueError('Unknown modifier: {}'.format(m))
return base_pitch
def pitch(self, pitch_str):
"""
Convert a string representation of a pitch class (consisting of root
note and modifiers) to an integer representation.
Parameters
----------
pitch_str : str
String representation of a pitch class.
Returns
-------
pitch : int
Integer representation of a pitch class.
"""
return self.modify(self._chroma_id[(ord(pitch_str[0]) - ord('C')) % 7],
pitch_str[1:]) % 12
def interval(self, interval_str):
"""
Convert a string representation of a musical interval into a pitch class
(e.g. a minor seventh 'b7' into 10, because it is 10 semitones above its
base note).
Parameters
----------
interval_str : str
Musical interval.
Returns
-------
pitch_class : int
Number of semitones to base note of interval.
"""
for i, c in enumerate(interval_str):
if c.isdigit():
return self.modify(self._chroma_id[int(interval_str[i:]) - 1],
interval_str[:i]) % 12
def interval_list(self, intervals_str, given_pitch_classes=None):
"""
Convert a list of intervals given as string to a binary pitch class
representation. For example, 'b3, 5' would become
[0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0].
Parameters
----------
intervals_str : str
List of intervals as comma-separated string (e.g. 'b3, 5').
given_pitch_classes : None or numpy array
If None, start with empty pitch class array, if numpy array of length
12, this array will be modified.
Returns
-------
pitch_classes : numpy array
Binary pitch class representation of intervals.
"""
if given_pitch_classes is None:
given_pitch_classes = np.zeros(12, dtype=np.int)
for int_def in intervals_str[1:-1].split(','):
int_def = int_def.strip()
if int_def[0] == '*':
given_pitch_classes[self.interval(int_def[1:])] = 0
else:
given_pitch_classes[self.interval(int_def)] = 1
return given_pitch_classes
# mapping of shorthand interval notations to the actual interval representation
def chord_intervals(self, quality_str):
"""
Convert a chord quality string to a pitch class representation. For
example, 'maj' becomes [1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0].
Parameters
----------
quality_str : str
String defining the chord quality.
Returns
-------
pitch_classes : numpy array
Binary pitch class representation of chord quality.
"""
list_idx = quality_str.find('(')
if list_idx == -1:
return self._shorthands[quality_str].copy()
if list_idx != 0:
ivs = self._shorthands[quality_str[:list_idx]].copy()
else:
ivs = np.zeros(12, dtype=np.int)
return self.interval_list(quality_str[list_idx:], ivs)
def load_chords(self, filename):
"""
Load chords from a text file.
The chord must follow the syntax defined in [1]_.
Parameters
----------
filename : str
File containing chord segments.
Returns
-------
crds : numpy structured array
Structured array with columns "start", "end", and "chord",
containing the beginning, end, and chord definition of chord
segments.
References
----------
.. [1] Christopher Harte, "Towards Automatic Extraction of Harmony
Information from Music Signals." Dissertation,
Department for Electronic Engineering, Queen Mary University of
London, 2010.
"""
start, end, chord_labels = [], [], []
with open(filename, 'r') as f:
for line in f:
if line:
splits = line.split()
if len(splits) == 3:
s = splits[0]
e = splits[1]
l = splits[2]
start.append(float(s))
end.append(float(e))
chord_labels.append(l)
crds = np.zeros(len(start), dtype=CHORD_ANN_DTYPE)
crds['start'] = start
crds['end'] = end
crds['chord'] = self.chords(chord_labels)
return crds
def reduce_to_triads(self, chords, keep_bass=False):
"""
Reduce chords to triads.
The function follows the reduction rules implemented in [1]_. If a chord
chord does not contain a third, major second or fourth, it is reduced to
a power chord. If it does not contain neither a third nor a fifth, it is
reduced to a single note "chord".
Parameters
----------
chords : numpy structured array
Chords to be reduced.
keep_bass : bool
Indicates whether to keep the bass note or set it to 0.
Returns
-------
reduced_chords : numpy structured array
Chords reduced to triads.
References
----------
.. [1] Johan Pauwels and Geoffroy Peeters.
"Evaluating Automatically Estimated Chord Sequences."
In Proceedings of ICASSP 2013, Vancouver, Canada, 2013.
"""
unison = chords['intervals'][:, 0].astype(bool)
maj_sec = chords['intervals'][:, 2].astype(bool)
min_third = chords['intervals'][:, 3].astype(bool)
maj_third = chords['intervals'][:, 4].astype(bool)
perf_fourth = chords['intervals'][:, 5].astype(bool)
dim_fifth = chords['intervals'][:, 6].astype(bool)
perf_fifth = chords['intervals'][:, 7].astype(bool)
aug_fifth = chords['intervals'][:, 8].astype(bool)
no_chord = (chords['intervals'] == NO_CHORD[-1]).all(axis=1)
reduced_chords = chords.copy()
ivs = reduced_chords['intervals']
ivs[~no_chord] = self.interval_list('(1)')
ivs[unison & perf_fifth] = self.interval_list('(1,5)')
ivs[~perf_fourth & maj_sec] = self._shorthands['sus2']
ivs[perf_fourth & ~maj_sec] = self._shorthands['sus4']
ivs[min_third] = self._shorthands['min']
ivs[min_third & aug_fifth & ~perf_fifth] = self.interval_list('(1,b3,#5)')
ivs[min_third & dim_fifth & ~perf_fifth] = self._shorthands['dim']
ivs[maj_third] = self._shorthands['maj']
ivs[maj_third & dim_fifth & ~perf_fifth] = self.interval_list('(1,3,b5)')
ivs[maj_third & aug_fifth & ~perf_fifth] = self._shorthands['aug']
if not keep_bass:
reduced_chords['bass'] = 0
else:
# remove bass notes if they are not part of the intervals anymore
reduced_chords['bass'] *= ivs[range(len(reduced_chords)),
reduced_chords['bass']]
# keep -1 in bass for no chords
reduced_chords['bass'][no_chord] = -1
return reduced_chords
def convert_to_id(self, root, is_major):
if root == -1:
return 24
else:
if is_major:
return root * 2
else:
return root * 2 + 1
def get_converted_chord(self, filename):
loaded_chord = self.load_chords(filename)
triads = self.reduce_to_triads(loaded_chord['chord'])
df = self.assign_chord_id(triads)
df['start'] = loaded_chord['start']
df['end'] = loaded_chord['end']
return df
def assign_chord_id(self, entry):
# maj, min chord only
# if you want to add other chord, change this part and get_converted_chord(reduce_to_triads)
df = pd.DataFrame(data=entry[['root', 'is_major']])
df['chord_id'] = df.apply(lambda row: self.convert_to_id(row['root'], row['is_major']), axis=1)
return df
def convert_to_id_voca(self, root, quality):
if root == -1:
return 169
else:
if quality == 'min':
return root * 14
elif quality == 'maj':
return root * 14 + 1
elif quality == 'dim':
return root * 14 + 2
elif quality == 'aug':
return root * 14 + 3
elif quality == 'min6':
return root * 14 + 4
elif quality == 'maj6':
return root * 14 + 5
elif quality == 'min7':
return root * 14 + 6
elif quality == 'minmaj7':
return root * 14 + 7
elif quality == 'maj7':
return root * 14 + 8
elif quality == '7':
return root * 14 + 9
elif quality == 'dim7':
return root * 14 + 10
elif quality == 'hdim7':
return root * 14 + 11
elif quality == 'sus2':
return root * 14 + 12
elif quality == 'sus4':
return root * 14 + 13
else:
return 168
def get_converted_chord_voca(self, filename):
loaded_chord = self.load_chords(filename)
triads = self.reduce_to_triads(loaded_chord['chord'])
df = pd.DataFrame(data=triads[['root', 'is_major']])
(ref_intervals, ref_labels) = mir_eval.io.load_labeled_intervals(filename)
ref_labels = self.lab_file_error_modify(ref_labels)
idxs = list()
for i in ref_labels:
chord_root, quality, scale_degrees, bass = mir_eval.chord.split(i, reduce_extended_chords=True)
root, bass, ivs, is_major = self.chord(i)
idxs.append(self.convert_to_id_voca(root=root, quality=quality))
df['chord_id'] = idxs
df['start'] = loaded_chord['start']
df['end'] = loaded_chord['end']
return df
def lab_file_error_modify(self, ref_labels):
for i in range(len(ref_labels)):
if ref_labels[i][-2:] == ':4':
ref_labels[i] = ref_labels[i].replace(':4', ':sus4')
elif ref_labels[i][-2:] == ':6':
ref_labels[i] = ref_labels[i].replace(':6', ':maj6')
elif ref_labels[i][-4:] == ':6/2':
ref_labels[i] = ref_labels[i].replace(':6/2', ':maj6/2')
elif ref_labels[i] == 'Emin/4':
ref_labels[i] = 'E:min/4'
elif ref_labels[i] == 'A7/3':
ref_labels[i] = 'A:7/3'
elif ref_labels[i] == 'Bb7/3':
ref_labels[i] = 'Bb:7/3'
elif ref_labels[i] == 'Bb7/5':
ref_labels[i] = 'Bb:7/5'
elif ref_labels[i].find(':') == -1:
if ref_labels[i].find('min') != -1:
ref_labels[i] = ref_labels[i][:ref_labels[i].find('min')] + ':' + ref_labels[i][ref_labels[i].find('min'):]
return ref_labels
================================================
FILE: utils/hparams.py
================================================
import yaml
# TODO: add function should be changed
class HParams(object):
# Hyperparameter class using yaml
def __init__(self, **kwargs):
self.__dict__ = kwargs
def add(self, **kwargs):
# change is needed - if key is existed, do not update.
self.__dict__.update(kwargs)
def update(self, **kwargs):
self.__dict__.update(kwargs)
return self
def save(self, path):
with open(path, 'w') as f:
yaml.dump(self.__dict__, f)
return self
def __repr__(self):
return '\nHyperparameters:\n' + '\n'.join([' {}={}'.format(k, v) for k, v in self.__dict__.items()])
@classmethod
def load(cls, path):
with open(path, 'r') as f:
return cls(**yaml.load(f))
if __name__ == '__main__':
hparams = HParams.load('hparams.yaml')
print(hparams)
d = {"MemoryNetwork": 0, "c": 1}
hparams.add(**d)
print(hparams)
================================================
FILE: utils/logger.py
================================================
import logging
import os
import sys
import time
project_name = os.getcwd().split('/')[-1]
_logger = logging.getLogger(project_name)
_logger.addHandler(logging.StreamHandler())
def _log_prefix():
# Returns (filename, line number) for the stack frame.
def _get_file_line():
# pylint: disable=protected-access
# noinspection PyProtectedMember
f = sys._getframe()
# pylint: enable=protected-access
our_file = f.f_code.co_filename
f = f.f_back
while f:
code = f.f_code
if code.co_filename != our_file:
return code.co_filename, f.f_lineno
f = f.f_back
return '<unknown>', 0
# current time
now = time.time()
now_tuple = time.localtime(now)
now_millisecond = int(1e3 * (now % 1.0))
# current filename and line
filename, line = _get_file_line()
basename = os.path.basename(filename)
s = '%02d-%02d %02d:%02d:%02d.%03d %s:%d] ' % (
now_tuple[1], # month
now_tuple[2], # day
now_tuple[3], # hour
now_tuple[4], # min
now_tuple[5], # sec
now_millisecond,
basename,
line)
return s
def logging_verbosity(verbosity=0):
_logger.setLevel(verbosity)
def debug(msg, *args, **kwargs):
_logger.debug('D ' + project_name + ' ' + _log_prefix() + msg, *args, **kwargs)
def info(msg, *args, **kwargs):
_logger.info('I ' + project_name + ' ' + _log_prefix() + msg, *args, **kwargs)
def warn(msg, *args, **kwargs):
_logger.warning('W ' + project_name + ' ' + _log_prefix() + msg, *args, **kwargs)
def error(msg, *args, **kwargs):
_logger.error('E ' + project_name + ' ' + _log_prefix() + msg, *args, **kwargs)
def fatal(msg, *args, **kwargs):
_logger.fatal('F ' + project_name + ' ' + _log_prefix() + msg, *args, **kwargs)
================================================
FILE: utils/mir_eval_modules.py
================================================
import numpy as np
import librosa
import mir_eval
import torch
import os
idx2chord = ['C', 'C:min', 'C#', 'C#:min', 'D', 'D:min', 'D#', 'D#:min', 'E', 'E:min', 'F', 'F:min', 'F#',
'F#:min', 'G', 'G:min', 'G#', 'G#:min', 'A', 'A:min', 'A#', 'A#:min', 'B', 'B:min', 'N']
root_list = ['C', 'C#', 'D', 'D#', 'E', 'F', 'F#', 'G', 'G#', 'A', 'A#', 'B']
quality_list = ['min', 'maj', 'dim', 'aug', 'min6', 'maj6', 'min7', 'minmaj7', 'maj7', '7', 'dim7', 'hdim7', 'sus2', 'sus4']
def idx2voca_chord():
idx2voca_chord = {}
idx2voca_chord[169] = 'N'
idx2voca_chord[168] = 'X'
for i in range(168):
root = i // 14
root = root_list[root]
quality = i % 14
quality = quality_list[quality]
if i % 14 != 1:
chord = root + ':' + quality
else:
chord = root
idx2voca_chord[i] = chord
return idx2voca_chord
def audio_file_to_features(audio_file, config):
original_wav, sr = librosa.load(audio_file, sr=config.mp3['song_hz'], mono=True)
currunt_sec_hz = 0
while len(original_wav) > currunt_sec_hz + config.mp3['song_hz'] * config.mp3['inst_len']:
start_idx = int(currunt_sec_hz)
end_idx = int(currunt_sec_hz + config.mp3['song_hz'] * config.mp3['inst_len'])
tmp = librosa.cqt(original_wav[start_idx:end_idx], sr=sr, n_bins=config.feature['n_bins'], bins_per_octave=config.feature['bins_per_octave'], hop_length=config.feature['hop_length'])
if start_idx == 0:
feature = tmp
else:
feature = np.concatenate((feature, tmp), axis=1)
currunt_sec_hz = end_idx
tmp = librosa.cqt(original_wav[currunt_sec_hz:], sr=sr, n_bins=config.feature['n_bins'], bins_per_octave=config.feature['bins_per_octave'], hop_length=config.feature['hop_length'])
feature = np.concatenate((feature, tmp), axis=1)
feature = np.log(np.abs(feature) + 1e-6)
feature_per_second = config.mp3['inst_len'] / config.model['timestep']
song_length_second = len(original_wav)/config.mp3['song_hz']
return feature, feature_per_second, song_length_second
# Audio files with format of wav and mp3
def get_audio_paths(audio_dir):
return [os.path.join(root, fname) for (root, dir_names, file_names) in os.walk(audio_dir, followlinks=True)
for fname in file_names if (fname.lower().endswith('.wav') or fname.lower().endswith('.mp3'))]
class metrics():
def __init__(self):
super(metrics, self).__init__()
self.score_metrics = ['root', 'thirds', 'triads', 'sevenths', 'tetrads', 'majmin', 'mirex']
self.score_list_dict = dict()
for i in self.score_metrics:
self.score_list_dict[i] = list()
self.average_score = dict()
def score(self, metric, gt_path, est_path):
if metric == 'root':
score = self.root_score(gt_path,est_path)
elif metric == 'thirds':
score = self.thirds_score(gt_path,est_path)
elif metric == 'triads':
score = self.triads_score(gt_path,est_path)
elif metric == 'sevenths':
score = self.sevenths_score(gt_path,est_path)
elif metric == 'tetrads':
score = self.tetrads_score(gt_path,est_path)
elif metric == 'majmin':
score = self.majmin_score(gt_path,est_path)
elif metric == 'mirex':
score = self.mirex_score(gt_path,est_path)
else:
raise NotImplementedError
return score
def root_score(self, gt_path, est_path):
(ref_intervals, ref_labels) = mir_eval.io.load_labeled_intervals(gt_path)
ref_labels = lab_file_error_modify(ref_labels)
(est_intervals, est_labels) = mir_eval.io.load_labeled_intervals(est_path)
est_intervals, est_labels = mir_eval.util.adjust_intervals(est_intervals, est_labels, ref_intervals.min(),
ref_intervals.max(), mir_eval.chord.NO_CHORD,
mir_eval.chord.NO_CHORD)
(intervals, ref_labels, est_labels) = mir_eval.util.merge_labeled_intervals(ref_intervals, ref_labels,
est_intervals, est_labels)
durations = mir_eval.util.intervals_to_durations(intervals)
comparisons = mir_eval.chord.root(ref_labels, est_labels)
score = mir_eval.chord.weighted_accuracy(comparisons, durations)
return score
def thirds_score(self, gt_path, est_path):
(ref_intervals, ref_labels) = mir_eval.io.load_labeled_intervals(gt_path)
ref_labels = lab_file_error_modify(ref_labels)
(est_intervals, est_labels) = mir_eval.io.load_labeled_intervals(est_path)
est_intervals, est_labels = mir_eval.util.adjust_intervals(est_intervals, est_labels, ref_intervals.min(),
ref_intervals.max(), mir_eval.chord.NO_CHORD,
mir_eval.chord.NO_CHORD)
(intervals, ref_labels, est_labels) = mir_eval.util.merge_labeled_intervals(ref_intervals, ref_labels,
est_intervals, est_labels)
durations = mir_eval.util.intervals_to_durations(intervals)
comparisons = mir_eval.chord.thirds(ref_labels, est_labels)
score = mir_eval.chord.weighted_accuracy(comparisons, durations)
return score
def triads_score(self, gt_path, est_path):
(ref_intervals, ref_labels) = mir_eval.io.load_labeled_intervals(gt_path)
ref_labels = lab_file_error_modify(ref_labels)
(est_intervals, est_labels) = mir_eval.io.load_labeled_intervals(est_path)
est_intervals, est_labels = mir_eval.util.adjust_intervals(est_intervals, est_labels, ref_intervals.min(),
ref_intervals.max(), mir_eval.chord.NO_CHORD,
mir_eval.chord.NO_CHORD)
(intervals, ref_labels, est_labels) = mir_eval.util.merge_labeled_intervals(ref_intervals, ref_labels,
est_intervals, est_labels)
durations = mir_eval.util.intervals_to_durations(intervals)
comparisons = mir_eval.chord.triads(ref_labels, est_labels)
score = mir_eval.chord.weighted_accuracy(comparisons, durations)
return score
def sevenths_score(self, gt_path, est_path):
(ref_intervals, ref_labels) = mir_eval.io.load_labeled_intervals(gt_path)
ref_labels = lab_file_error_modify(ref_labels)
(est_intervals, est_labels) = mir_eval.io.load_labeled_intervals(est_path)
est_intervals, est_labels = mir_eval.util.adjust_intervals(est_intervals, est_labels, ref_intervals.min(),
ref_intervals.max(), mir_eval.chord.NO_CHORD,
mir_eval.chord.NO_CHORD)
(intervals, ref_labels, est_labels) = mir_eval.util.merge_labeled_intervals(ref_intervals, ref_labels,
est_intervals, est_labels)
durations = mir_eval.util.intervals_to_durations(intervals)
comparisons = mir_eval.chord.sevenths(ref_labels, est_labels)
score = mir_eval.chord.weighted_accuracy(comparisons, durations)
return score
def tetrads_score(self, gt_path, est_path):
(ref_intervals, ref_labels) = mir_eval.io.load_labeled_intervals(gt_path)
ref_labels = lab_file_error_modify(ref_labels)
(est_intervals, est_labels) = mir_eval.io.load_labeled_intervals(est_path)
est_intervals, est_labels = mir_eval.util.adjust_intervals(est_intervals, est_labels, ref_intervals.min(),
ref_intervals.max(), mir_eval.chord.NO_CHORD,
mir_eval.chord.NO_CHORD)
(intervals, ref_labels, est_labels) = mir_eval.util.merge_labeled_intervals(ref_intervals, ref_labels,
est_intervals, est_labels)
durations = mir_eval.util.intervals_to_durations(intervals)
comparisons = mir_eval.chord.tetrads(ref_labels, est_labels)
score = mir_eval.chord.weighted_accuracy(comparisons, durations)
return score
def majmin_score(self, gt_path, est_path):
(ref_intervals, ref_labels) = mir_eval.io.load_labeled_intervals(gt_path)
ref_labels = lab_file_error_modify(ref_labels)
(est_intervals, est_labels) = mir_eval.io.load_labeled_intervals(est_path)
est_intervals, est_labels = mir_eval.util.adjust_intervals(est_intervals, est_labels, ref_intervals.min(),
ref_intervals.max(), mir_eval.chord.NO_CHORD,
mir_eval.chord.NO_CHORD)
(intervals, ref_labels, est_labels) = mir_eval.util.merge_labeled_intervals(ref_intervals, ref_labels,
est_intervals, est_labels)
durations = mir_eval.util.intervals_to_durations(intervals)
comparisons = mir_eval.chord.majmin(ref_labels, est_labels)
score = mir_eval.chord.weighted_accuracy(comparisons, durations)
return score
def mirex_score(self, gt_path, est_path):
(ref_intervals, ref_labels) = mir_eval.io.load_labeled_intervals(gt_path)
ref_labels = lab_file_error_modify(ref_labels)
(est_intervals, est_labels) = mir_eval.io.load_labeled_intervals(est_path)
est_intervals, est_labels = mir_eval.util.adjust_intervals(est_intervals, est_labels, ref_intervals.min(),
ref_intervals.max(), mir_eval.chord.NO_CHORD,
mir_eval.chord.NO_CHORD)
(intervals, ref_labels, est_labels) = mir_eval.util.merge_labeled_intervals(ref_intervals, ref_labels,
est_intervals, est_labels)
durations = mir_eval.util.intervals_to_durations(intervals)
comparisons = mir_eval.chord.mirex(ref_labels, est_labels)
score = mir_eval.chord.weighted_accuracy(comparisons, durations)
return score
def lab_file_error_modify(ref_labels):
for i in range(len(ref_labels)):
if ref_labels[i][-2:] == ':4':
ref_labels[i] = ref_labels[i].replace(':4', ':sus4')
elif ref_labels[i][-2:] == ':6':
ref_labels[i] = ref_labels[i].replace(':6', ':maj6')
elif ref_labels[i][-4:] == ':6/2':
ref_labels[i] = ref_labels[i].replace(':6/2', ':maj6/2')
elif ref_labels[i] == 'Emin/4':
ref_labels[i] = 'E:min/4'
elif ref_labels[i] == 'A7/3':
ref_labels[i] = 'A:7/3'
elif ref_labels[i] == 'Bb7/3':
ref_labels[i] = 'Bb:7/3'
elif ref_labels[i] == 'Bb7/5':
ref_labels[i] = 'Bb:7/5'
elif ref_labels[i].find(':') == -1:
if ref_labels[i].find('min') != -1:
ref_labels[i] = ref_labels[i][:ref_labels[i].find('min')] + ':' + ref_labels[i][ref_labels[i].find('min'):]
return ref_labels
def root_majmin_score_calculation(valid_dataset, config, mean, std, device, model, model_type, verbose=False):
valid_song_names = valid_dataset.song_names
paths = valid_dataset.preprocessor.get_all_files()
metrics_ = metrics()
song_length_list = list()
for path in paths:
song_name, lab_file_path, mp3_file_path, _ = path
if not song_name in valid_song_names:
continue
try:
n_timestep = config.model['timestep']
feature, feature_per_second, song_length_second = audio_file_to_features(mp3_file_path, config)
feature = feature.T
feature = (feature - mean) / std
time_unit = feature_per_second
num_pad = n_timestep - (feature.shape[0] % n_timestep)
feature = np.pad(feature, ((0, num_pad), (0, 0)), mode="constant", constant_values=0)
num_instance = feature.shape[0] // n_timestep
start_time = 0.0
lines = []
with torch.no_grad():
model.eval()
feature = torch.tensor(feature, dtype=torch.float32).unsqueeze(0).to(device)
for t in range(num_instance):
if model_type == 'btc':
encoder_output, _ = model.self_attn_layers(feature[:, n_timestep * t:n_timestep * (t + 1), :])
prediction, _ = model.output_layer(encoder_output)
prediction = prediction.squeeze()
elif model_type == 'cnn' or model_type =='crnn':
prediction, _, _, _ = model(feature[:, n_timestep * t:n_timestep * (t + 1), :], torch.randint(config.model['num_chords'], (n_timestep,)).to(device))
for i in range(n_timestep):
if t == 0 and i == 0:
prev_chord = prediction[i].item()
continue
if prediction[i].item() != prev_chord:
lines.append(
'%.6f %.6f %s\n' % (
start_time, time_unit * (n_timestep * t + i), idx2chord[prev_chord]))
start_time = time_unit * (n_timestep * t + i)
prev_chord = prediction[i].item()
if t == num_instance - 1 and i + num_pad == n_timestep:
if start_time != time_unit * (n_timestep * t + i):
lines.append(
'%.6f %.6f %s\n' % (
start_time, time_unit * (n_timestep * t + i), idx2chord[prev_chord]))
break
pid = os.getpid()
tmp_path = 'tmp_' + str(pid) + '.lab'
with open(tmp_path, 'w') as f:
for line in lines:
f.write(line)
root_majmin = ['root', 'majmin']
for m in root_majmin:
metrics_.score_list_dict[m].append(metrics_.score(metric=m, gt_path=lab_file_path, est_path=tmp_path))
song_length_list.append(song_length_second)
if verbose:
for m in root_majmin:
print('song name %s, %s score : %.4f' % (song_name, m, metrics_.score_list_dict[m][-1]))
except:
print('song name %s\' lab file error' % song_name)
tmp = song_length_list / np.sum(song_length_list)
for m in root_majmin:
metrics_.average_score[m] = np.sum(np.multiply(metrics_.score_list_dict[m], tmp))
return metrics_.score_list_dict, song_length_list, metrics_.average_score
def root_majmin_score_calculation_crf(valid_dataset, config, mean, std, device, pre_model, model, model_type, verbose=False):
valid_song_names = valid_dataset.song_names
paths = valid_dataset.preprocessor.get_all_files()
metrics_ = metrics()
song_length_list = list()
for path in paths:
song_name, lab_file_path, mp3_file_path, _ = path
if not song_name in valid_song_names:
continue
try:
n_timestep = config.model['timestep']
feature, feature_per_second, song_length_second = audio_file_to_features(mp3_file_path, config)
feature = feature.T
feature = (feature - mean) / std
time_unit = feature_per_second
num_pad = n_timestep - (feature.shape[0] % n_timestep)
feature = np.pad(feature, ((0, num_pad), (0, 0)), mode="constant", constant_values=0)
num_instance = feature.shape[0] // n_timestep
start_time = 0.0
lines = []
with torch.no_grad():
model.eval()
feature = torch.tensor(feature, dtype=torch.float32).unsqueeze(0).to(device)
for t in range(num_instance):
if (model_type == 'cnn') or (model_type == 'crnn') or (model_type == 'btc'):
logits = pre_model(feature[:, n_timestep * t:n_timestep * (t + 1), :], torch.randint(config.model['num_chords'], (n_timestep,)).to(device))
prediction, _ = model(logits, torch.randint(config.model['num_chords'], (n_timestep,)).to(device))
else:
raise NotImplementedError
for i in range(n_timestep):
if t == 0 and i == 0:
prev_chord = prediction[i].item()
continue
if prediction[i].item() != prev_chord:
lines.append(
'%.6f %.6f %s\n' % (
start_time, time_unit * (n_timestep * t + i), idx2chord[prev_chord]))
start_time = time_unit * (n_timestep * t + i)
prev_chord = prediction[i].item()
if t == num_instance - 1 and i + num_pad == n_timestep:
if start_time != time_unit * (n_timestep * t + i):
lines.append(
'%.6f %.6f %s\n' % (
start_time, time_unit * (n_timestep * t + i), idx2chord[prev_chord]))
break
pid = os.getpid()
tmp_path = 'tmp_' + str(pid) + '.lab'
with open(tmp_path, 'w') as f:
for line in lines:
f.write(line)
root_majmin = ['root', 'majmin']
for m in root_majmin:
metrics_.score_list_dict[m].append(metrics_.score(metric=m, gt_path=lab_file_path, est_path=tmp_path))
song_length_list.append(song_length_second)
if verbose:
for m in root_majmin:
print('song name %s, %s score : %.4f' % (song_name, m, metrics_.score_list_dict[m][-1]))
except:
print('song name %s\' lab file error' % song_name)
tmp = song_length_list / np.sum(song_length_list)
for m in root_majmin:
metrics_.average_score[m] = np.sum(np.multiply(metrics_.score_list_dict[m], tmp))
return metrics_.score_list_dict, song_length_list, metrics_.average_score
def large_voca_score_calculation(valid_dataset, config, mean, std, device, model, model_type, verbose=False):
idx2voca = idx2voca_chord()
valid_song_names = valid_dataset.song_names
paths = valid_dataset.preprocessor.get_all_files()
metrics_ = metrics()
song_length_list = list()
for path in paths:
song_name, lab_file_path, mp3_file_path, _ = path
if not song_name in valid_song_names:
continue
try:
n_timestep = config.model['timestep']
feature, feature_per_second, song_length_second = audio_file_to_features(mp3_file_path, config)
feature = feature.T
feature = (feature - mean) / std
time_unit = feature_per_second
num_pad = n_timestep - (feature.shape[0] % n_timestep)
feature = np.pad(feature, ((0, num_pad), (0, 0)), mode="constant", constant_values=0)
num_instance = feature.shape[0] // n_timestep
start_time = 0.0
lines = []
with torch.no_grad():
model.eval()
feature = torch.tensor(feature, dtype=torch.float32).unsqueeze(0).to(device)
for t in range(num_instance):
if model_type == 'btc':
encoder_output, _ = model.self_attn_layers(feature[:, n_timestep * t:n_timestep * (t + 1), :])
prediction, _ = model.output_layer(encoder_output)
prediction = prediction.squeeze()
elif model_type == 'cnn' or model_type =='crnn':
prediction, _, _, _ = model(feature[:, n_timestep * t:n_timestep * (t + 1), :], torch.randint(config.model['num_chords'], (n_timestep,)).to(device))
for i in range(n_timestep):
if t == 0 and i == 0:
prev_chord = prediction[i].item()
continue
if prediction[i].item() != prev_chord:
lines.append(
'%.6f %.6f %s\n' % (
start_time, time_unit * (n_timestep * t + i), idx2voca[prev_chord]))
start_time = time_unit * (n_timestep * t + i)
prev_chord = prediction[i].item()
if t == num_instance - 1 and i + num_pad == n_timestep:
if start_time != time_unit * (n_timestep * t + i):
lines.append(
'%.6f %.6f %s\n' % (
start_time, time_unit * (n_timestep * t + i), idx2voca[prev_chord]))
break
pid = os.getpid()
tmp_path = 'tmp_' + str(pid) + '.lab'
with open(tmp_path, 'w') as f:
for line in lines:
f.write(line)
for m in metrics_.score_metrics:
metrics_.score_list_dict[m].append(metrics_.score(metric=m, gt_path=lab_file_path, est_path=tmp_path))
song_length_list.append(song_length_second)
if verbose:
for m in metrics_.score_metrics:
print('song name %s, %s score : %.4f' % (song_name, m, metrics_.score_list_dict[m][-1]))
except:
print('song name %s\' lab file error' % song_name)
tmp = song_length_list / np.sum(song_length_list)
for m in metrics_.score_metrics:
metrics_.average_score[m] = np.sum(np.multiply(metrics_.score_list_dict[m], tmp))
return metrics_.score_list_dict, song_length_list, metrics_.average_score
def large_voca_score_calculation_crf(valid_dataset, config, mean, std, device, pre_model, model, model_type, verbose=False):
idx2voca = idx2voca_chord()
valid_song_names = valid_dataset.song_names
paths = valid_dataset.preprocessor.get_all_files()
metrics_ = metrics()
song_length_list = list()
for path in paths:
song_name, lab_file_path, mp3_file_path, _ = path
if not song_name in valid_song_names:
continue
try:
n_timestep = config.model['timestep']
feature, feature_per_second, song_length_second = audio_file_to_features(mp3_file_path, config)
feature = feature.T
feature = (feature - mean) / std
time_unit = feature_per_second
num_pad = n_timestep - (feature.shape[0] % n_timestep)
feature = np.pad(feature, ((0, num_pad), (0, 0)), mode="constant", constant_values=0)
num_instance = feature.shape[0] // n_timestep
start_time = 0.0
lines = []
with torch.no_grad():
model.eval()
feature = torch.tensor(feature, dtype=torch.float32).unsqueeze(0).to(device)
for t in range(num_instance):
if (model_type == 'cnn') or (model_type == 'crnn') or (model_type == 'btc'):
logits = pre_model(feature[:, n_timestep * t:n_timestep * (t + 1), :], torch.randint(config.model['num_chords'], (n_timestep,)).to(device))
prediction, _ = model(logits, torch.randint(config.model['num_chords'], (n_timestep,)).to(device))
else:
raise NotImplementedError
for i in range(n_timestep):
if t == 0 and i == 0:
prev_chord = prediction[i].item()
continue
if prediction[i].item() != prev_chord:
lines.append(
'%.6f %.6f %s\n' % (
start_time, time_unit * (n_timestep * t + i), idx2voca[prev_chord]))
start_time = time_unit * (n_timestep * t + i)
prev_chord = prediction[i].item()
if t == num_instance - 1 and i + num_pad == n_timestep:
if start_time != time_unit * (n_timestep * t + i):
lines.append(
'%.6f %.6f %s\n' % (
start_time, time_unit * (n_timestep * t + i), idx2voca[prev_chord]))
break
pid = os.getpid()
tmp_path = 'tmp_' + str(pid) + '.lab'
with open(tmp_path, 'w') as f:
for line in lines:
f.write(line)
for m in metrics_.score_metrics:
metrics_.score_list_dict[m].append(metrics_.score(metric=m, gt_path=lab_file_path, est_path=tmp_path))
song_length_list.append(song_length_second)
if verbose:
for m in metrics_.score_metrics:
print('song name %s, %s score : %.4f' % (song_name, m, metrics_.score_list_dict[m][-1]))
except:
print('song name %s\' lab file error' % song_name)
tmp = song_length_list / np.sum(song_length_list)
for m in metrics_.score_metrics:
metrics_.average_score[m] = np.sum(np.multiply(metrics_.score_list_dict[m], tmp))
return metrics_.score_list_dict, song_length_list, metrics_.average_score
================================================
FILE: utils/preprocess.py
================================================
import os
import librosa
from utils.chords import Chords
import re
from enum import Enum
import pyrubberband as pyrb
import torch
import math
class FeatureTypes(Enum):
cqt = 'cqt'
class Preprocess():
def __init__(self, config, feature_to_use, dataset_names, root_dir):
self.config = config
self.dataset_names = dataset_names
self.root_path = root_dir + '/'
self.time_interval = config.feature["hop_length"]/config.mp3["song_hz"]
self.no_of_chord_datapoints_per_sequence = math.ceil(config.mp3['inst_len'] / self.time_interval)
self.Chord_class = Chords()
# isophonic
self.isophonic_directory = self.root_path + 'isophonic/'
# uspop
self.uspop_directory = self.root_path + 'uspop/'
self.uspop_audio_path = 'audio/'
self.uspop_lab_path = 'annotations/uspopLabels/'
self.uspop_index_path = 'annotations/uspopLabels.txt'
# robbie williams
self.robbie_williams_directory = self.root_path + 'robbiewilliams/'
self.robbie_williams_audio_path = 'audio/'
self.robbie_williams_lab_path = 'chords/'
self.feature_name = feature_to_use
self.is_cut_last_chord = False
def find_mp3_path(self, dirpath, word):
for filename in os.listdir(dirpath):
last_dir = dirpath.split("/")[-2]
if ".mp3" in filename:
tmp = filename.replace(".mp3", "")
tmp = tmp.replace(last_dir, "")
filename_lower = tmp.lower()
filename_lower = " ".join(re.findall("[a-zA-Z]+", filename_lower))
if word.lower().replace(" ", "") in filename_lower.replace(" ", ""):
return filename
def find_mp3_path_robbiewilliams(self, dirpath, word):
for filename in os.listdir(dirpath):
if ".mp3" in filename:
tmp = filename.replace(".mp3", "")
filename_lower = tmp.lower()
filename_lower = filename_lower.replace("robbie williams", "")
filename_lower = " ".join(re.findall("[a-zA-Z]+", filename_lower))
filename_lower = self.song_pre(filename_lower)
if self.song_pre(word.lower()).replace(" ", "") in filename_lower.replace(" ", ""):
return filename
def get_all_files(self):
res_list = []
# isophonic
if "isophonic" in self.dataset_names:
for dirpath, dirnames, filenames in os.walk(self.isophonic_directory):
if not dirnames:
for filename in filenames:
if ".lab" in filename:
tmp = filename.replace(".lab", "")
song_name = " ".join(re.findall("[a-zA-Z]+", tmp)).replace("CD", "")
mp3_path = self.find_mp3_path(dirpath, song_name)
res_list.append([song_name, os.path.join(dirpath, filename), os.path.join(dirpath, mp3_path),
os.path.join(self.root_path, "result", "isophonic")])
# uspop
if "uspop" in self.dataset_names:
with open(os.path.join(self.uspop_directory, self.uspop_index_path)) as f:
uspop_lab_list = f.readlines()
uspop_lab_list = [x.strip() for x in uspop_lab_list]
for lab_path in uspop_lab_list:
spl = lab_path.split('/')
lab_artist = self.uspop_pre(spl[2])
lab_title = self.uspop_pre(spl[4][3:-4])
lab_path = lab_path.replace('./uspopLabels/', '')
lab_path = os.path.join(self.uspop_directory, self.uspop_lab_path, lab_path)
for filename in os.listdir(os.path.join(self.uspop_directory, self.uspop_audio_path)):
if not '.csv' in filename:
spl = filename.split('-')
mp3_artist = self.uspop_pre(spl[0])
mp3_title = self.uspop_pre(spl[1][:-4])
if lab_artist == mp3_artist and lab_title == mp3_title:
res_list.append([mp3_artist + mp3_title, lab_path,
os.path.join(self.uspop_directory, self.uspop_audio_path, filename),
os.path.join(self.root_path, "result", "uspop")])
break
# robbie williams
if "robbiewilliams" in self.dataset_names:
for dirpath, dirnames, filenames in os.walk(self.robbie_williams_directory):
if not dirnames:
for filename in filenames:
if ".txt" in filename and (not 'README' in filename):
tmp = filename.replace(".txt", "")
song_name = " ".join(re.findall("[a-zA-Z]+", tmp)).replace("GTChords", "")
mp3_dir = dirpath.replace("chords", "audio")
mp3_path = self.find_mp3_path_robbiewilliams(mp3_dir, song_name)
res_list.append([song_name, os.path.join(dirpath, filename), os.path.join(mp3_dir, mp3_path),
os.path.join(self.root_path, "result", "robbiewilliams")])
return res_list
def uspop_pre(self, text):
text = text.lower()
text = text.replace('_', '')
text = text.replace(' ', '')
text = " ".join(re.findall("[a-zA-Z]+", text))
return text
def song_pre(self, text):
to_remove = ["'", '`', '(', ')', ' ', '&', 'and', 'And']
for remove in to_remove:
text = text.replace(remove, '')
return text
def config_to_folder(self):
mp3_config = self.config.mp3
feature_config = self.config.feature
mp3_string = "%d_%.1f_%.1f" % \
(mp3_config['song_hz'], mp3_config['inst_len'],
mp3_config['skip_interval'])
feature_string = "%s_%d_%d_%d" % \
(self.feature_name.value, feature_config['n_bins'], feature_config['bins_per_octave'], feature_config['hop_length'])
return mp3_config, feature_config, mp3_string, feature_string
def generate_labels_features_new(self, all_list):
pid = os.getpid()
mp3_config, feature_config, mp3_str, feature_str = self.config_to_folder()
i = 0 # number of songs
j = 0 # number of impossible songs
k = 0 # number of tried songs
total = 0 # number of generated instances
stretch_factors = [1.0]
shift_factors = [-5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5, 6]
loop_broken = False
for song_name, lab_path, mp3_path, save_path in all_list:
# different song initialization
if loop_broken:
loop_broken = False
i += 1
print(pid, "generating features from ...", os.path.join(mp3_path))
if i % 10 == 0:
print(i, ' th song')
original_wav, sr = librosa.load(os.path.join(mp3_path), sr=mp3_config['song_hz'])
# make result path if not exists
# save_path, mp3_string, feature_string, song_name, aug.pt
result_path = os.path.join(save_path, mp3_str, feature_str, song_name.strip())
if not os.path.exists(result_path):
os.makedirs(result_path)
# calculate result
for stretch_factor in stretch_factors:
if loop_broken:
loop_broken = False
break
for shift_factor in shift_factors:
# for filename
idx = 0
chord_info = self.Chord_class.get_converted_chord(os.path.join(lab_path))
k += 1
# stretch original sound and chord info
x = pyrb.time_stretch(original_wav, sr, stretch_factor)
x = pyrb.pitch_shift(x, sr, shift_factor)
audio_length = x.shape[0]
chord_info['start'] = chord_info['start'] * 1/stretch_factor
chord_info['end'] = chord_info['end'] * 1/stretch_factor
last_sec = chord_info.iloc[-1]['end']
last_sec_hz = int(last_sec * mp3_config['song_hz'])
if audio_length + mp3_config['skip_interval'] < last_sec_hz:
print('loaded song is too short :', song_name)
loop_broken = True
j += 1
break
elif audio_length > last_sec_hz:
x = x[:last_sec_hz]
origin_length = last_sec_hz
origin_length_in_sec = origin_length / mp3_config['song_hz']
current_start_second = 0
# get chord list between current_start_second and current+song_length
while current_start_second + mp3_config['inst_len'] < origin_length_in_sec:
inst_start_sec = current_start_second
curSec = current_start_second
chord_list = []
# extract chord per 1/self.time_interval
while curSec < inst_start_sec + mp3_config['inst_len']:
try:
available_chords = chord_info.loc[(chord_info['start'] <= curSec) & (
chord_info['end'] > curSec + self.time_interval)].copy()
if len(available_chords) == 0:
available_chords = chord_info.loc[((chord_info['start'] >= curSec) & (
chord_info['start'] <= curSec + self.time_interval)) | (
(chord_info['end'] >= curSec) & (
chord_info['end'] <= curSec + self.time_interval))].copy()
if len(available_chords) == 1:
chord = available_chords['chord_id'].iloc[0]
elif len(available_chords) > 1:
max_starts = available_chords.apply(lambda row: max(row['start'], curSec),
axis=1)
available_chords['max_start'] = max_starts
min_ends = available_chords.apply(
lambda row: min(row.end, curSec + self.time_interval), axis=1)
available_chords['min_end'] = min_ends
chords_lengths = available_chords['min_end'] - available_chords['max_start']
available_chords['chord_length'] = chords_lengths
chord = available_chords.ix[available_chords['chord_length'].idxmax()]['chord_id']
else:
chord = 24
except Exception as e:
chord = 24
print(e)
print(pid, "no chord")
raise RuntimeError()
finally:
# convert chord by shift factor
if chord != 24:
chord += shift_factor * 2
chord = chord % 24
chord_list.append(chord)
curSec += self.time_interval
if len(chord_list) == self.no_of_chord_datapoints_per_sequence:
try:
sequence_start_time = current_start_second
sequence_end_time = current_start_second + mp3_config['inst_len']
start_index = int(sequence_start_time * mp3_config['song_hz'])
end_index = int(sequence_end_time * mp3_config['song_hz'])
song_seq = x[start_index:end_index]
etc = '%.1f_%.1f' % (
current_start_second, current_start_second + mp3_config['inst_len'])
aug = '%.2f_%i' % (stretch_factor, shift_factor)
if self.feature_name == FeatureTypes.cqt:
# print(pid, "make feature")
feature = librosa.cqt(song_seq, sr=sr, n_bins=feature_config['n_bins'],
bins_per_octave=feature_config['bins_per_octave'],
hop_length=feature_config['hop_length'])
else:
raise NotImplementedError
if feature.shape[1] > self.no_of_chord_datapoints_per_sequence:
feature = feature[:, :self.no_of_chord_datapoints_per_sequence]
if feature.shape[1] != self.no_of_chord_datapoints_per_sequence:
print('loaded features length is too short :', song_name)
loop_broken = True
j += 1
break
result = {
'feature': feature,
'chord': chord_list,
'etc': etc
}
# save_path, mp3_string, feature_string, song_name, aug.pt
filename = aug + "_" + str(idx) + ".pt"
torch.save(result, os.path.join(result_path, filename))
idx += 1
total += 1
except Exception as e:
print(e)
print(pid, "feature error")
raise RuntimeError()
else:
print("invalid number of chord datapoints in sequence :", len(chord_list))
current_start_second += mp3_config['skip_interval']
print(pid, "total instances: %d" % total)
def generate_labels_features_voca(self, all_list):
pid = os.getpid()
mp3_config, feature_config, mp3_str, feature_str = self.config_to_folder()
i = 0 # number of songs
j = 0 # number of impossible songs
k = 0 # number of tried songs
total = 0 # number of generated instances
stretch_factors = [1.0]
shift_factors = [-5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5, 6]
loop_broken = False
for song_name, lab_path, mp3_path, save_path in all_list:
save_path = save_path + '_voca'
# different song initialization
if loop_broken:
loop_broken = False
i += 1
print(pid, "generating features from ...", os.path.join(mp3_path))
if i % 10 == 0:
print(i, ' th song')
original_wav, sr = librosa.load(os.path.join(mp3_path), sr=mp3_config['song_hz'])
# save_path, mp3_string, feature_string, song_name, aug.pt
result_path = os.path.join(save_path, mp3_str, feature_str, song_name.strip())
if not os.path.exists(result_path):
os.makedirs(result_path)
# calculate result
for stretch_factor in stretch_factors:
if loop_broken:
loop_broken = False
break
for shift_factor in shift_factors:
# for filename
idx = 0
try:
chord_info = self.Chord_class.get_converted_chord_voca(os.path.join(lab_path))
except Exception as e:
print(e)
print(pid, " chord lab file error : %s" % song_name)
loop_broken = True
j += 1
break
k += 1
# stretch original sound and chord info
x = pyrb.time_stretch(original_wav, sr, stretch_factor)
x = pyrb.pitch_shift(x, sr, shift_factor)
audio_length = x.shape[0]
chord_info['start'] = chord_info['start'] * 1/stretch_factor
chord_info['end'] = chord_info['end'] * 1/stretch_factor
last_sec = chord_info.iloc[-1]['end']
last_sec_hz = int(last_sec * mp3_config['song_hz'])
if audio_length + mp3_config['skip_interval'] < last_sec_hz:
print('loaded song is too short :', song_name)
loop_broken = True
j += 1
break
elif audio_length > last_sec_hz:
x = x[:last_sec_hz]
origin_length = last_sec_hz
origin_length_in_sec = origin_length / mp3_config['song_hz']
current_start_second = 0
# get chord list between current_start_second and current+song_length
while current_start_second + mp3_config['inst_len'] < origin_length_in_sec:
inst_start_sec = current_start_second
curSec = current_start_second
chord_list = []
# extract chord per 1/self.time_interval
while curSec < inst_start_sec + mp3_config['inst_len']:
try:
available_chords = chord_info.loc[(chord_info['start'] <= curSec) & (chord_info['end'] > curSec + self.time_interval)].copy()
if len(available_chords) == 0:
available_chords = chord_info.loc[((chord_info['start'] >= curSec) & (chord_info['start'] <= curSec + self.time_interval)) | ((chord_info['end'] >= curSec) & (chord_info['end'] <= curSec + self.time_interval))].copy()
if len(available_chords) == 1:
chord = available_chords['chord_id'].iloc[0]
elif len(available_chords) > 1:
max_starts = available_chords.apply(lambda row: max(row['start'], curSec),axis=1)
available_chords['max_start'] = max_starts
min_ends = available_chords.apply(lambda row: min(row.end, curSec + self.time_interval), axis=1)
available_chords['min_end'] = min_ends
chords_lengths = available_chords['min_end'] - available_chords['max_start']
available_chords['chord_length'] = chords_lengths
chord = available_chords.ix[available_chords['chord_length'].idxmax()]['chord_id']
else:
chord = 169
except Exception as e:
chord = 169
print(e)
print(pid, "no chord")
raise RuntimeError()
finally:
# convert chord by shift factor
if chord != 169 and chord != 168:
chord += shift_factor * 14
chord = chord % 168
chord_list.append(chord)
curSec += self.time_interval
if len(chord_list) == self.no_of_chord_datapoints_per_sequence:
try:
sequence_start_time = current_start_second
sequence_end_time = current_start_second + mp3_config['inst_len']
start_index = int(sequence_start_time * mp3_config['song_hz'])
end_index = int(sequence_end_time * mp3_config['song_hz'])
song_seq = x[start_index:end_index]
etc = '%.1f_%.1f' % (
current_start_second, current_start_second + mp3_config['inst_len'])
aug = '%.2f_%i' % (stretch_factor, shift_factor)
if self.feature_name == FeatureTypes.cqt:
feature = librosa.cqt(song_seq, sr=sr, n_bins=feature_config['n_bins'],
bins_per_octave=feature_config['bins_per_octave'],
hop_length=feature_config['hop_length'])
else:
raise NotImplementedError
if feature.shape[1] > self.no_of_chord_datapoints_per_sequence:
feature = feature[:, :self.no_of_chord_datapoints_per_sequence]
if feature.shape[1] != self.no_of_chord_datapoints_per_sequence:
print('loaded features length is too short :', song_name)
loop_broken = True
j += 1
break
result = {
'feature': feature,
'chord': chord_list,
'etc': etc
}
# save_path, mp3_string, feature_string, song_name, aug.pt
filename = aug + "_" + str(idx) + ".pt"
torch.save(result, os.path.join(result_path, filename))
idx += 1
total += 1
except Exception as e:
print(e)
print(pid, "feature error")
raise RuntimeError()
else:
print("invalid number of chord datapoints in sequence :", len(chord_list))
current_start_second += mp3_config['skip_interval']
print(pid, "total instances: %d" % total)
================================================
FILE: utils/pytorch_utils.py
================================================
import torch
import numpy as np
import os
import math
from utils import logger
use_cuda = torch.cuda.is_available()
# optimization
# reference: http://pytorch.org/docs/master/_modules/torch/optim/lr_scheduler.html#ReduceLROnPlateau
def adjusting_learning_rate(optimizer, factor=.5, min_lr=0.00001):
for i, param_group in enumerate(optimizer.param_groups):
old_lr = float(param_group['lr'])
new_lr = max(old_lr * factor, min_lr)
param_group['lr'] = new_lr
logger.info('adjusting learning rate from %.6f to %.6f' % (old_lr, new_lr))
# model save and loading
def load_model(asset_path, model, optimizer, restore_epoch=0):
if os.path.isfile(os.path.join(asset_path, 'model', 'checkpoint_%d.pth.tar' % restore_epoch), map_location=lambda storage, loc: storage):
checkpoint = torch.load(os.path.join(asset_path, 'model', 'checkpoint_%d.pth.tar' % restore_epoch))
model.load_state_dict(checkpoint['model'])
optimizer.load_state_dict(checkpoint['optimizer'])
current_step = checkpoint['current_step']
logger.info("restore model with %d epoch" % restore_epoch)
else:
logger.info("no checkpoint with %d epoch" % restore_epoch)
current_step = 0
return model, optimizer, current_step
================================================
FILE: utils/tf_logger.py
================================================
import tensorflow as tf
import numpy as np
import scipy.misc
try:
from StringIO import StringIO # Python 2.7
except ImportError:
from io import BytesIO # Python 3.x
class TF_Logger(object):
def __init__(self, log_dir):
"""Create a summary writer logging to log_dir."""
self.writer = tf.summary.FileWriter(log_dir)
def scalar_summary(self, tag, value, step):
"""Log a scalar variable."""
summary = tf.Summary(value=[tf.Summary.Value(tag=tag, simple_value=value)])
self.writer.add_summary(summary, step)
def image_summary(self, tag, images, step):
"""Log a list of images."""
img_summaries = []
for i, img in enumerate(images):
# Write the image to a string
try:
s = StringIO()
except:
s = BytesIO()
scipy.misc.toimage(img).save(s, format="png")
# Create an Image object
img_sum = tf.Summary.Image(encoded_image_string=s.getvalue(),
height=img.shape[0],
width=img.shape[1])
# Create a Summary value
img_summaries.append(tf.Summary.Value(tag='%s/%d' % (tag, i), image=img_sum))
# Create and write Summary
summary = tf.Summary(value=img_summaries)
self.writer.add_summary(summary, step)
def histo_summary(self, tag, values, step, bins=1000):
"""Log a histogram of the tensor of values."""
# Create a histogram using numpy
counts, bin_edges = np.histogram(values, bins=bins)
# Fill the fields of the histogram proto
hist = tf.HistogramProto()
hist.min = float(np.min(values))
hist.max = float(np.max(values))
hist.num = int(np.prod(values.shape))
hist.sum = float(np.sum(values))
hist.sum_squares = float(np.sum(values ** 2))
# Drop the start of the first bin
bin_edges = bin_edges[1:]
# Add bin edges and counts
for edge in bin_edges:
hist.bucket_limit.append(edge)
for c in counts:
hist.bucket.append(c)
# Create and write Summary
summary = tf.Summary(value=[tf.Summary.Value(tag=tag, histo=hist)])
self.writer.add_summary(summary, step)
self.writer.flush()
================================================
FILE: utils/transformer_modules.py
================================================
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import math
def _gen_bias_mask(max_length):
"""
Generates bias values (-Inf) to mask future timesteps during attention
"""
np_mask = np.triu(np.full([max_length, max_length], -np.inf), 1)
torch_mask = torch.from_numpy(np_mask).type(torch.FloatTensor)
return torch_mask.unsqueeze(0).unsqueeze(1)
def _gen_timing_signal(length, channels, min_timescale=1.0, max_timescale=1.0e4):
"""
Generates a [1, length, channels] timing signal consisting of sinusoids
Adapted from:
https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/layers/common_attention.py
"""
position = np.arange(length)
num_timescales = channels // 2
log_timescale_increment = (
math.log(float(max_timescale) / float(min_timescale)) /
(float(num_timescales) - 1))
inv_timescales = min_timescale * np.exp(
np.arange(num_timescales).astype(np.float) * -log_timescale_increment)
scaled_time = np.expand_dims(position, 1) * np.expand_dims(inv_timescales, 0)
signal = np.concatenate([np.sin(scaled_time), np.cos(scaled_time)], axis=1)
signal = np.pad(signal, [[0, 0], [0, channels % 2]],
'constant', constant_values=[0.0, 0.0])
signal = signal.reshape([1, length, channels])
return torch.from_numpy(signal).type(torch.FloatTensor)
class LayerNorm(nn.Module):
# Borrowed from jekbradbury
# https://github.com/pytorch/pytorch/issues/1959
def __init__(self, features, eps=1e-6):
super(LayerNorm, self).__init__()
self.gamma = nn.Parameter(torch.ones(features))
self.beta = nn.Parameter(torch.zeros(features))
self.eps = eps
def forward(self, x):
mean = x.mean(-1, keepdim=True)
std = x.std(-1, keepdim=True)
return self.gamma * (x - mean) / (std + self.eps) + self.beta
class OutputLayer(nn.Module):
"""
Abstract base class for output layer.
Handles projection to output labels
"""
def __init__(self, hidden_size, output_size, probs_out=False):
super(OutputLayer, self).__init__()
self.output_size = output_size
self.output_projection = nn.Linear(hidden_size, output_size)
self.probs_out = probs_out
self.lstm = nn.LSTM(input_size=hidden_size, hidden_size=int(hidden_size/2), batch_first=True, bidirectional=True)
self.hidden_size = hidden_size
def loss(self, hidden, labels):
raise NotImplementedError('Must implement {}.loss'.format(self.__class__.__name__))
class SoftmaxOutputLayer(OutputLayer):
"""
Implements a softmax based output layer
"""
def forward(self, hidden):
logits = self.output_projection(hidden)
probs = F.softmax(logits, -1)
# _, predictions = torch.max(probs, dim=-1)
topk, indices = torch.topk(probs, 2)
predictions = indices[:,:,0]
second = indices[:,:,1]
if self.probs_out is True:
return logits
# return probs
return predictions, second
def loss(self, hidden, labels):
logits = self.output_projection(hidden)
log_probs = F.log_softmax(logits, -1)
return F.nll_loss(log_probs.view(-1, self.output_size), labels.view(-1))
class MultiHeadAttention(nn.Module):
"""
Multi-head attention as per https://arxiv.org/pdf/1706.03762.pdf
Refer Figure 2
"""
def __init__(self, input_depth, total_key_depth, total_value_depth, output_depth,
num_heads, bias_mask=None, dropout=0.0, attention_map=False):
"""
Parameters:
input_depth: Size of last dimension of input
total_key_depth: Size of last dimension of keys. Must be divisible by num_head
total_value_depth: Size of last dimension of values. Must be divisible by num_head
output_depth: Size last dimension of the final output
num_heads: Number of attention heads
bias_mask: Masking tensor to prevent connections to future elements
dropout: Dropout probability (Should be non-zero only during training)
"""
super(MultiHeadAttention, self).__init__()
# Checks borrowed from
# https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/layers/common_attention.py
if total_key_depth % num_heads != 0:
raise ValueError("Key depth (%d) must be divisible by the number of "
"attention heads (%d)." % (total_key_depth, num_heads))
if total_value_depth % num_heads != 0:
raise ValueError("Value depth (%d) must be divisible by the number of "
"attention heads (%d)." % (total_value_depth, num_heads))
self.attention_map = attention_map
self.num_heads = num_heads
self.query_scale = (total_key_depth // num_heads) ** -0.5
self.bias_mask = bias_mask
# Key and query depth will be same
self.query_linear = nn.Linear(input_depth, total_key_depth, bias=False)
self.key_linear = nn.Linear(input_depth, total_key_depth, bias=False)
self.value_linear = nn.Linear(input_depth, total_value_depth, bias=False)
self.output_linear = nn.Linear(total_value_depth, output_depth, bias=False)
self.dropout = nn.Dropout(dropout)
def _split_heads(self, x):
"""
Split x such to add an extra num_heads dimension
Input:
x: a Tensor with shape [batch_size, seq_length, depth]
Returns:
A Tensor with shape [batch_size, num_heads, seq_length, depth/num_heads]
"""
if len(x.shape) != 3:
raise ValueError("x must have rank 3")
shape = x.shape
return x.view(shape[0], shape[1], self.num_heads, shape[2] // self.num_heads).permute(0, 2, 1, 3)
def _merge_heads(self, x):
"""
Merge the extra num_heads into the last dimension
Input:
x: a Tensor with shape [batch_size, num_heads, seq_length, depth/num_heads]
Returns:
A Tensor with shape [batch_size, seq_length, depth]
"""
if len(x.shape) != 4:
raise ValueError("x must have rank 4")
shape = x.shape
return x.permute(0, 2, 1, 3).contiguous().view(shape[0], shape[2], shape[3] * self.num_heads)
def forward(self, queries, keys, values):
# Do a linear for each component
queries = self.query_linear(queries)
keys = self.key_linear(keys)
values = self.value_linear(values)
# Split into multiple heads
queries = self._split_heads(queries)
keys = self._split_heads(keys)
values = self._split_heads(values)
# Scale queries
queries *= self.query_scale
# Combine queries and keys
logits = torch.matmul(queries, keys.permute(0, 1, 3, 2))
# Add bias to mask future values
if self.bias_mask is not None:
logits += self.bias_mask[:, :, :logits.shape[-2], :logits.shape[-1]].type_as(logits.data)
# Convert to probabilites
weights = nn.functional.softmax(logits, dim=-1)
# Dropout
weights = self.dropout(weights)
# Combine with values to get context
contexts = torch.matmul(weights, values)
# Merge heads
contexts = self._merge_heads(contexts)
# contexts = torch.tanh(contexts)
# Linear to get output
outputs = self.output_linear(contexts)
if self.attention_map is True:
return outputs, weights
return outputs
class Conv(nn.Module):
"""
Convenience class that does padding and convolution for inputs in the format
[batch_size, sequence length, hidden size]
"""
def __init__(self, input_size, output_size, kernel_size, pad_type):
"""
Parameters:
input_size: Input feature size
output_size: Output feature size
kernel_size: Kernel width
pad_type: left -> pad on the left side (to mask future data_loader),
both -> pad on both sides
"""
super(Conv, self).__init__()
padding = (kernel_size - 1, 0) if pad_type == 'left' else (kernel_size // 2, (kernel_size - 1) // 2)
self.pad = nn.ConstantPad1d(padding, 0)
self.conv = nn.Conv1d(input_size, output_size, kernel_size=kernel_size, padding=0)
def forward(self, inputs):
inputs = self.pad(inputs.permute(0, 2, 1))
outputs = self.conv(inputs).permute(0, 2, 1)
return outputs
class PositionwiseFeedForward(nn.Module):
"""
Does a Linear + RELU + Linear on each of the timesteps
"""
def __init__(self, input_depth, filter_size, output_depth, layer_config='ll', padding='left', dropout=0.0):
"""
Parameters:
input_depth: Size of last dimension of input
filter_size: Hidden size of the middle layer
output_depth: Size last dimension of the final output
layer_config: ll -> linear + ReLU + linear
cc -> conv + ReLU + conv etc.
padding: left -> pad on the left side (to mask future data_loader),
both -> pad on both sides
dropout: Dropout probability (Should be non-zero only during training)
"""
super(PositionwiseFeedForward, self).__init__()
layers = []
sizes = ([(input_depth, filter_size)] +
[(filter_size, filter_size)] * (len(layer_config) - 2) +
[(filter_size, output_depth)])
for lc, s in zip(list(layer_config), sizes):
if lc == 'l':
layers.append(nn.Linear(*s))
elif lc == 'c':
layers.append(Conv(*s, kernel_size=3, pad_type=padding))
else:
raise ValueError("Unknown layer type {}".format(lc))
self.layers = nn.ModuleList(layers)
self.relu = nn.ReLU()
self.dropout = nn.Dropout(dropout)
def forward(self, inputs):
x = inputs
for i, layer in enumerate(self.layers):
x = layer(x)
if i < len(self.layers):
x = self.relu(x)
x = self.dropout(x)
return x
gitextract_upyc_iog/
├── LICENSE
├── README.md
├── audio_dataset.py
├── baseline_models.py
├── btc_model.py
├── crf_model.py
├── run_config.yaml
├── test/
│ ├── btc_model.pt
│ └── btc_model_large_voca.pt
├── test.py
├── train.py
├── train_crf.py
└── utils/
├── __init__.py
├── chords.py
├── hparams.py
├── logger.py
├── mir_eval_modules.py
├── preprocess.py
├── pytorch_utils.py
├── tf_logger.py
└── transformer_modules.py
SYMBOL INDEX (130 symbols across 12 files)
FILE: audio_dataset.py
class AudioDataset (line 10) | class AudioDataset(Dataset):
method __init__ (line 11) | def __init__(self, config, root_dir='/data/music/chord_recognition', d...
method __len__ (line 75) | def __len__(self):
method __getitem__ (line 78) | def __getitem__(self, idx):
method get_paths (line 87) | def get_paths(self, kfold=4):
method get_paths_voca (line 141) | def get_paths_voca(self, kfold=4):
function _collate_fn (line 195) | def _collate_fn(batch):
class AudioDataLoader (line 226) | class AudioDataLoader(DataLoader):
method __init__ (line 227) | def __init__(self, *args, **kwargs):
FILE: baseline_models.py
class CNN (line 10) | class CNN(nn.Module):
method __init__ (line 11) | def __init__(self,config):
method cnn_layers (line 31) | def cnn_layers(self, in_channels, out_channels, kernel_size, stride=1,...
method forward (line 39) | def forward(self, x, labels):
class Crf (line 79) | class Crf(nn.Module):
method __init__ (line 80) | def __init__(self, num_chords, timestep):
method forward (line 86) | def forward(self, probs, labels):
class CRNN (line 93) | class CRNN(nn.Module):
method __init__ (line 94) | def __init__(self,config):
method forward (line 110) | def forward(self, x, labels):
FILE: btc_model.py
class self_attention_block (line 7) | class self_attention_block(nn.Module):
method __init__ (line 8) | def __init__(self, hidden_size, total_key_depth, total_value_depth, fi...
method forward (line 19) | def forward(self, inputs):
class bi_directional_self_attention (line 47) | class bi_directional_self_attention(nn.Module):
method __init__ (line 48) | def __init__(self, hidden_size, total_key_depth, total_value_depth, fi...
method forward (line 83) | def forward(self, inputs):
class bi_directional_self_attention_layers (line 100) | class bi_directional_self_attention_layers(nn.Module):
method __init__ (line 101) | def __init__(self, embedding_size, hidden_size, num_layers, num_heads,...
method forward (line 121) | def forward(self, inputs):
class BTC_model (line 138) | class BTC_model(nn.Module):
method __init__ (line 139) | def __init__(self, config):
method forward (line 161) | def forward(self, x, labels):
FILE: crf_model.py
class CRF (line 8) | class CRF(nn.Module):
method __init__ (line 14) | def __init__(self, num_tags):
method forward (line 24) | def forward(self, feats):
method loss (line 31) | def loss(self, feats, tags):
method _sequence_score (line 61) | def _sequence_score(self, feats, tags):
method _partition_function (line 89) | def _partition_function(self, feats):
method _viterbi (line 113) | def _viterbi(self, feats):
method _log_sum_exp (line 147) | def _log_sum_exp(self, logits, dim):
FILE: utils/chords.py
function idx_to_chord (line 52) | def idx_to_chord(idx):
class Chords (line 63) | class Chords:
method __init__ (line 65) | def __init__(self):
method chords (line 95) | def chords(self, labels):
method label_error_modify (line 124) | def label_error_modify(self, label):
method chord (line 134) | def chord(self, label):
method modify (line 199) | def modify(self, base_pitch, modifier):
method pitch (line 228) | def pitch(self, pitch_str):
method interval (line 247) | def interval(self, interval_str):
method interval_list (line 269) | def interval_list(self, intervals_str, given_pitch_classes=None):
method chord_intervals (line 301) | def chord_intervals(self, quality_str):
method load_chords (line 328) | def load_chords(self, filename):
method reduce_to_triads (line 377) | def reduce_to_triads(self, chords, keep_bass=False):
method convert_to_id (line 442) | def convert_to_id(self, root, is_major):
method get_converted_chord (line 451) | def get_converted_chord(self, filename):
method assign_chord_id (line 461) | def assign_chord_id(self, entry):
method convert_to_id_voca (line 468) | def convert_to_id_voca(self, root, quality):
method get_converted_chord_voca (line 503) | def get_converted_chord_voca(self, filename):
method lab_file_error_modify (line 522) | def lab_file_error_modify(self, ref_labels):
FILE: utils/hparams.py
class HParams (line 5) | class HParams(object):
method __init__ (line 7) | def __init__(self, **kwargs):
method add (line 10) | def add(self, **kwargs):
method update (line 14) | def update(self, **kwargs):
method save (line 18) | def save(self, path):
method __repr__ (line 23) | def __repr__(self):
method load (line 27) | def load(cls, path):
FILE: utils/logger.py
function _log_prefix (line 11) | def _log_prefix():
function logging_verbosity (line 51) | def logging_verbosity(verbosity=0):
function debug (line 55) | def debug(msg, *args, **kwargs):
function info (line 59) | def info(msg, *args, **kwargs):
function warn (line 63) | def warn(msg, *args, **kwargs):
function error (line 67) | def error(msg, *args, **kwargs):
function fatal (line 71) | def fatal(msg, *args, **kwargs):
FILE: utils/mir_eval_modules.py
function idx2voca_chord (line 13) | def idx2voca_chord():
function audio_file_to_features (line 29) | def audio_file_to_features(audio_file, config):
function get_audio_paths (line 49) | def get_audio_paths(audio_dir):
class metrics (line 53) | class metrics():
method __init__ (line 54) | def __init__(self):
method score (line 62) | def score(self, metric, gt_path, est_path):
method root_score (line 81) | def root_score(self, gt_path, est_path):
method thirds_score (line 95) | def thirds_score(self, gt_path, est_path):
method triads_score (line 109) | def triads_score(self, gt_path, est_path):
method sevenths_score (line 123) | def sevenths_score(self, gt_path, est_path):
method tetrads_score (line 137) | def tetrads_score(self, gt_path, est_path):
method majmin_score (line 151) | def majmin_score(self, gt_path, est_path):
method mirex_score (line 165) | def mirex_score(self, gt_path, est_path):
function lab_file_error_modify (line 179) | def lab_file_error_modify(ref_labels):
function root_majmin_score_calculation (line 200) | def root_majmin_score_calculation(valid_dataset, config, mean, std, devi...
function root_majmin_score_calculation_crf (line 271) | def root_majmin_score_calculation_crf(valid_dataset, config, mean, std, ...
function large_voca_score_calculation (line 342) | def large_voca_score_calculation(valid_dataset, config, mean, std, devic...
function large_voca_score_calculation_crf (line 413) | def large_voca_score_calculation_crf(valid_dataset, config, mean, std, d...
FILE: utils/preprocess.py
class FeatureTypes (line 10) | class FeatureTypes(Enum):
class Preprocess (line 13) | class Preprocess():
method __init__ (line 14) | def __init__(self, config, feature_to_use, dataset_names, root_dir):
method find_mp3_path (line 40) | def find_mp3_path(self, dirpath, word):
method find_mp3_path_robbiewilliams (line 51) | def find_mp3_path_robbiewilliams(self, dirpath, word):
method get_all_files (line 62) | def get_all_files(self):
method uspop_pre (line 116) | def uspop_pre(self, text):
method song_pre (line 123) | def song_pre(self, text):
method config_to_folder (line 131) | def config_to_folder(self):
method generate_labels_features_new (line 142) | def generate_labels_features_new(self, all_list):
method generate_labels_features_voca (line 305) | def generate_labels_features_voca(self, all_list):
FILE: utils/pytorch_utils.py
function adjusting_learning_rate (line 13) | def adjusting_learning_rate(optimizer, factor=.5, min_lr=0.00001):
function load_model (line 22) | def load_model(asset_path, model, optimizer, restore_epoch=0):
FILE: utils/tf_logger.py
class TF_Logger (line 11) | class TF_Logger(object):
method __init__ (line 12) | def __init__(self, log_dir):
method scalar_summary (line 16) | def scalar_summary(self, tag, value, step):
method image_summary (line 21) | def image_summary(self, tag, images, step):
method histo_summary (line 44) | def histo_summary(self, tag, values, step, bins=1000):
FILE: utils/transformer_modules.py
function _gen_bias_mask (line 10) | def _gen_bias_mask(max_length):
function _gen_timing_signal (line 18) | def _gen_timing_signal(length, channels, min_timescale=1.0, max_timescal...
class LayerNorm (line 40) | class LayerNorm(nn.Module):
method __init__ (line 43) | def __init__(self, features, eps=1e-6):
method forward (line 49) | def forward(self, x):
class OutputLayer (line 54) | class OutputLayer(nn.Module):
method __init__ (line 59) | def __init__(self, hidden_size, output_size, probs_out=False):
method loss (line 67) | def loss(self, hidden, labels):
class SoftmaxOutputLayer (line 70) | class SoftmaxOutputLayer(OutputLayer):
method forward (line 74) | def forward(self, hidden):
method loss (line 86) | def loss(self, hidden, labels):
class MultiHeadAttention (line 91) | class MultiHeadAttention(nn.Module):
method __init__ (line 97) | def __init__(self, input_depth, total_key_depth, total_value_depth, ou...
method _split_heads (line 133) | def _split_heads(self, x):
method _merge_heads (line 146) | def _merge_heads(self, x):
method forward (line 159) | def forward(self, queries, keys, values):
class Conv (line 203) | class Conv(nn.Module):
method __init__ (line 209) | def __init__(self, input_size, output_size, kernel_size, pad_type):
method forward (line 223) | def forward(self, inputs):
class PositionwiseFeedForward (line 230) | class PositionwiseFeedForward(nn.Module):
method __init__ (line 235) | def __init__(self, input_depth, filter_size, output_depth, layer_confi...
method forward (line 266) | def forward(self, inputs):
Condensed preview — 21 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (156K chars).
[
{
"path": "LICENSE",
"chars": 1070,
"preview": "MIT License\n\nCopyright (c) 2019 Jonggwon Park\n\nPermission is hereby granted, free of charge, to any person obtaining a c"
},
{
"path": "README.md",
"chars": 2655,
"preview": "# A Bi-Directional Transformer for Musical Chord Recognition\n\nThis repository has the source codes for the paper \"A Bi-D"
},
{
"path": "audio_dataset.py",
"chars": 9662,
"preview": "import numpy as np\nimport os\nimport torch\nfrom torch.utils.data import Dataset, DataLoader\nfrom utils.preprocess import "
},
{
"path": "baseline_models.py",
"chars": 6082,
"preview": "from utils.hparams import HParams\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\nimport time\nfrom cr"
},
{
"path": "btc_model.py",
"chars": 7361,
"preview": "from utils.transformer_modules import *\nfrom utils.transformer_modules import _gen_timing_signal, _gen_bias_mask\nfrom ut"
},
{
"path": "crf_model.py",
"chars": 5618,
"preview": "from __future__ import absolute_import\nfrom __future__ import division\nfrom __future__ import print_function\n\nimport tor"
},
{
"path": "run_config.yaml",
"chars": 791,
"preview": "mp3:\n song_hz: 22050\n inst_len: 10.0\n skip_interval: 5.0\n\nfeature:\n n_bins: 144\n bins_per_octave: 24\n hop_length: "
},
{
"path": "test.py",
"chars": 5042,
"preview": "import os\nimport mir_eval\nimport pretty_midi as pm\nfrom utils import logger\nfrom btc_model import *\nfrom utils.mir_eval_"
},
{
"path": "train.py",
"chars": 13719,
"preview": "import os\nfrom torch import optim\nfrom utils import logger\nfrom audio_dataset import AudioDataset, AudioDataLoader\nfrom "
},
{
"path": "train_crf.py",
"chars": 14591,
"preview": "import os\nfrom torch import optim\nfrom utils import logger\nfrom audio_dataset import AudioDataset, AudioDataLoader\nfrom "
},
{
"path": "utils/__init__.py",
"chars": 0,
"preview": ""
},
{
"path": "utils/chords.py",
"chars": 18307,
"preview": "# encoding: utf-8\n\"\"\"\nThis module contains chord evaluation functionality.\n\nIt provides the evaluation measures used for"
},
{
"path": "utils/hparams.py",
"chars": 940,
"preview": "import yaml\n\n\n# TODO: add function should be changed\nclass HParams(object):\n # Hyperparameter class using yaml\n de"
},
{
"path": "utils/logger.py",
"chars": 1869,
"preview": "import logging\nimport os\nimport sys\nimport time\n\n\nproject_name = os.getcwd().split('/')[-1]\n_logger = logging.getLogger("
},
{
"path": "utils/mir_eval_modules.py",
"chars": 26311,
"preview": "import numpy as np\nimport librosa\nimport mir_eval\nimport torch\nimport os\n\nidx2chord = ['C', 'C:min', 'C#', 'C#:min', 'D'"
},
{
"path": "utils/preprocess.py",
"chars": 23371,
"preview": "import os\nimport librosa\nfrom utils.chords import Chords\nimport re\nfrom enum import Enum\nimport pyrubberband as pyrb\nimp"
},
{
"path": "utils/pytorch_utils.py",
"chars": 1283,
"preview": "\nimport torch\nimport numpy as np\nimport os\nimport math\nfrom utils import logger\n\nuse_cuda = torch.cuda.is_available()\n\n\n"
},
{
"path": "utils/tf_logger.py",
"chars": 2364,
"preview": "import tensorflow as tf\nimport numpy as np\nimport scipy.misc\n\ntry:\n from StringIO import StringIO # Python 2.7\nexcep"
},
{
"path": "utils/transformer_modules.py",
"chars": 10480,
"preview": "from __future__ import absolute_import\nfrom __future__ import division\nfrom __future__ import print_function\nimport torc"
}
]
// ... and 2 more files (download for full content)
About this extraction
This page contains the full source code of the jayg996/BTC-ISMIR19 GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 21 files (23.4 MB), approximately 35.8k tokens, and a symbol index with 130 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.