Repository: francesclluis/source-separation-wavenet Branch: master Commit: c80bb531f32d Files: 16 Total size: 76.7 MB Directory structure: gitextract_t_rsro8e/ ├── LICENSE ├── README.md ├── config.md ├── config_multi_instrument.json ├── config_singing_voice.json ├── datasets.py ├── environment.yml ├── layers.py ├── main.py ├── models.py ├── separate.py ├── sessions/ │ ├── multi-instrument/ │ │ ├── checkpoints/ │ │ │ └── checkpoint.00045-0.hdf5 │ │ └── config.json │ └── singing-voice/ │ ├── checkpoints/ │ │ └── checkpoint.00058-0.hdf5 │ └── config.json └── util.py ================================================ FILE CONTENTS ================================================ ================================================ FILE: LICENSE ================================================ MIT License Copyright (c) 2018 Francesc Lluís Salvadó Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ================================================ FILE: README.md ================================================ A Wavenet for Music Source Separation ==== A neural network for end-to-end music source separation, as described in [End-to-end music source separation: is it possible in the waveform domain?](https://arxiv.org/abs/1810.12187) Listen to separated samples [here](http://jordipons.me/apps/end-to-end-music-source-separation/) What is a Wavenet for Music Source Separation? ----- The Wavenet for Music Source Separation is a fully convolutional neural network that directly operates on the raw audio waveform. It is an adaptation of [Wavenet](https://deepmind.com/blog/wavenet-generative-model-raw-audio/) that turns the original causal model (that is generative and slow), into a non-causal model (that is discriminative and parallelizable). This idea was originally proposed by [Rethage et al.](https://arxiv.org/abs/1706.07162) for speech denoising and now it is adapted for monaural music source separation. Their [code](https://github.com/drethage/speech-denoising-wavenet) is reused. The main difference between the original Wavenet and the non-causal adaptation used, is that some samples from the future can be used to predict the present one. As a result of removing the autoregressive causal nature of the original Wavenet, this fully convolutional model is now able to predict a target field instead of one sample at a time – due to this parallelization, it is possible to run the model in real-time on a GPU.

See the diagram below for a summary of the network architecture.

Installation ----- 1. `git clone https://github.com/francesclluis/source-separation-wavenet.git` 2. Install [conda](https://conda.io/docs/user-guide/install/index.html) 3. `conda env create -f environment.yml` 4. `source activate sswavenet` *Currently the project requires **Keras 2.1** and **Theano 1.0.1**, the large dilations present in the architecture are not supported by the current version of Tensorflow* Usage ----- A pre-trained multi-instrument model (best-performing model described in the paper) can be found in `sessions/multi-instrument/checkpoints` and is ready to be used out-of-the-box. The parameterization of this model is specified in `sessions/multi-instrument/config.json` A pre-trained singing-voice model (best-performing model described in the paper) can be found in `sessions/singing-voice/checkpoints` and is ready to be used out-of-the-box. The parameterization of this model is specified in `sessions/singing-voice/config.json` *Download the dataset as described [below](https://github.com/francesclluis/source-separation-wavenet#dataset)* #### Source Separation: Example (multi-instrument): `THEANO_FLAGS=device=cuda python main.py --mode inference --config sessions/multi-instrument/config.json --mixture_input_path audio/` Example (singing-voice): `THEANO_FLAGS=device=cuda python main.py --mode inference --config sessions/singing-voice/config.json --mixture_input_path audio/` ###### Speedup To achieve faster source separation, one can increase the target-field length by use of the optional `--target_field_length` argument. This defines the amount of samples that are separated in a single forward propagation, saving redundant calculations. In the following example, it is increased 10x that of when the model was trained, the batch_size is reduced to 4. Faster Example: `THEANO_FLAGS=device=cuda python main.py --mode inference --target_field_length 16001 --batch_size 4 --config sessions/multi-instrument/config.json --mixture_input_path audio/` #### Training: Example (multi-instrument): `THEANO_FLAGS=device=cuda python main.py --mode training --target multi-instrument --config config_multi_instrument.json` Example (singing-voice): `THEANO_FLAGS=device=cuda python main.py --mode training --target singing-voice --config config_singing_voice.json` #### Configuration A detailed description of all configurable parameters can be found in [config.md](https://github.com/francesclluis/source-separation-wavenet/blob/master/config.md) #### Optional command-line arguments: Argument | Valid Inputs | Default | Description -------- | ---- | ---------- | ----- mode | [training, inference] | training | target | [multi-instrument, singing-voice] | multi-instrument | Target of the model to train config | string | config.json | Path to JSON-formatted config file print_model_summary | bool | False | Prints verbose summary of the model load_checkpoint | string | None | Path to hdf5 file containing a snapshot of model weights #### Additional arguments during source separation: Argument | Valid Inputs | Default | Description -------- | ------------ | ------- | ----------- one_shot | bool | False | Separates each audio file in a single forward propagation target_field_length | int | as defined in config.json | Overrides parameter in config.json for separating with different target-field lengths than used in training batch_size | int | as defined in config.json | # of samples per batch Dataset ----- The MUSDB18 is used for training the model. It is provided by the Community-Based Signal Separation Evaluation Campaign (SISEC). 1. [Download here](https://sigsep.github.io/datasets/musdb.html#download) 2. Decode dataset to WAV format as explained [here](https://github.com/sigsep/sigsep-mus-io) 3. Extract to `data/MUSDB` ================================================ FILE: config.md ================================================ config.json - Configuring a training session ---- The parameters present in a `config.json` file allow one to configure a training session. Each of these parameters is described below: ### Dataset How the data is used for training * **extract_voice_percentage**: (float) Proportion of the data containing singing voice (instead of vocal streams having silence) * **in_memory_percentage**: (float) Percentage of the dataset to load into memory, useful when dataset requires more memory than available * **path**: (string) Path to dataset * **sample_rate**: (int) Sample rate to which all samples should be resampled to * **type**: (string) Identifier of which dataset is being used for training ### Model What the model will be * **condition_encoding**: (string) Which numerical representation to encode integer condition values to, either binary or one-hot * **dilations**: (int) Maximum dilation factor as an exponent of 2, e.g. dilations = 9 results in a maximum dilation of 2^9 = 512 * **filters**: * **lengths**: * **res**: (int) Lengths of convolution kernels in residual blocks * **final**: ([int, int]) Lengths of convolution kernels in final layers, individually definable * **skip**: (int) Lengths of convolution kernels in skip connections * **depths**: * **res**: (int) Number of filters in residual-block convolution layers * **skip**: (int) Number of filters in skip connections * **final**: ([int, int]) Number of filters in final layers, individually definable * **num_stacks**: (int) Number of stacks, as defined in the paper * **target_field_length**: (int) Length of the output * **target_padding**: (int) Number of samples used for padding the target_field *per side* ### Training How training will be carried out * **batch_size**: (int) Number of samples per batch * **early_stopping_patience**: (int) Number of epochs to wait without improvement in accuracy before stopping training * **loss**: (in the case of multi-instrument) * **out_1**: First term in the three term loss (vocals) * **l1**: (float) Percentage weight given to L1 loss * **l2**: (float) Percentage weight given to L2 loss * **weight**: (float) Percentage weight given to first term * **out_2**: Second term in the three term loss (drums) * **l1**: (float) Percentage weight given to L1 loss * **l2**: (float) Percentage weight given to L2 loss * **weight**: (float) Percentage weight given to second term * **out_3**: Third term in the three term loss (bass) * **l1**: (float) Percentage weight given to L1 loss * **l2**: (float) Percentage weight given to L2 loss * **weight**: (float) Percentage weight given to third term * **loss**: (in the case of singing-voice) * **out_1**: First term in the two term loss (singing voice) * **l1**: (float) Percentage weight given to L1 loss * **l2**: (float) Percentage weight given to L2 loss * **weight**: (float) Percentage weight given to first term * **out_2**: Second term in the two term loss (dissimilarity singing voice) * **l1**: (float) Percentage weight given to L1 loss * **l2**: (float) Percentage weight given to L2 loss * **weight**: (float) Percentage weight given to second term * **num_epochs**: (int) Maximum number of epochs to train for * **num_steps_test**: (int) Total number of steps (batches of samples) to yield from validation generator before stopping at the end of every epoch. * **num_steps_train**: (int) Total number of steps (batches of samples) to yield from training generator before declaring one epoch finished and starting the next epoch. * **path**: (string) Path to the folder containing all files pertaining to the training session * **verbosity**: (int) Keras verbosity level ================================================ FILE: config_multi_instrument.json ================================================ { "dataset": { "in_memory_percentage": 1, "extract_voice_percentage": 0, "path": "data/MUSDB", "sample_rate": 16000, "type": "musdb18" }, "model": { "condition_encoding": "binary", "dilations": 9, "filters": { "lengths": { "res": 3, "final": [3, 3], "skip": 1 }, "depths": { "res": 64, "skip": 64, "final": [2048, 256] } }, "num_stacks": 4, "target_field_length": 1601, "target_padding": 1 }, "optimizer": { "decay": 0.0, "epsilon": 1e-08, "lr": 0.001, "momentum": 0.9, "type": "adam" }, "training": { "batch_size": 10, "early_stopping_patience": 16, "loss": { "out_1": { "l1": 1, "l2": 0, "weight": 1 }, "out_2": { "l1": 1, "l2": 0, "weight": 1 }, "out_3": { "l1": 1, "l2": 0, "weight": 1 } }, "num_epochs": 250, "num_steps_test": 500, "num_steps_train": 2000, "path": "sessions/003", "verbosity": 1 } } ================================================ FILE: config_singing_voice.json ================================================ { "dataset": { "in_memory_percentage": 1, "extract_voice_percentage": 0.5, "path": "data/MUSDB", "sample_rate": 16000, "type": "musdb18" }, "model": { "condition_encoding": "binary", "dilations": 9, "filters": { "lengths": { "res": 3, "final": [3, 3], "skip": 1 }, "depths": { "res": 64, "skip": 64, "final": [2048, 256] } }, "num_stacks": 4, "target_field_length": 1601, "target_padding": 1 }, "optimizer": { "decay": 0.0, "epsilon": 1e-08, "lr": 0.001, "momentum": 0.9, "type": "adam" }, "training": { "batch_size": 10, "early_stopping_patience": 16, "loss": { "out_1": { "l1": 1, "l2": 0, "weight": 1 }, "out_2": { "l1": 1, "l2": 0, "weight": -0.05 } }, "num_epochs": 250, "num_steps_test": 500, "num_steps_train": 2000, "path": "sessions/002", "verbosity": 1 } } ================================================ FILE: datasets.py ================================================ # A Wavenet For Source Separation - Francesc Lluis - 25.10.2018 # Datasets.py import util import os import numpy as np import musdb import logging class SingingVoiceMUSDB18Dataset(): def __init__(self, config, model): self.model = model self.path = config['dataset']['path'] self.sample_rate = config['dataset']['sample_rate'] self.file_paths = {'train': {'vocals': [], 'mixture': [], 'drums': [], 'other': [], 'bass': []}, 'val': {'vocals': [], 'mixture': [], 'drums': [], 'other': [], 'bass': []}} self.sequences = {'train': {'vocals': [], 'mixture': [], 'drums': [], 'other': [], 'bass': []}, 'val': {'vocals': [], 'mixture': [], 'drums': [], 'other': [], 'bass': []}} self.voice_indices = {'train': [], 'val': []} self.batch_size = config['training']['batch_size'] self.extract_voice_percent = config['dataset']['extract_voice_percentage'] self.in_memory_percentage = config['dataset']['in_memory_percentage'] self.num_sequences_in_memory = 0 self.condition_encode_function = util.get_condition_input_encode_func(config['model']['condition_encoding']) def load_dataset(self): print('Loading MUSDB18 dataset for singing voice separation...') mus = musdb.DB(root_dir=self.path, is_wav=True) tracks = mus.load_mus_tracks(subsets='train') np.random.seed(seed=1337) val_idx = np.random.choice(len(tracks), size=25, replace=False) train_idx = [i for i in range(len(tracks)) if i not in val_idx] val_tracks = [tracks[i] for i in val_idx] train_tracks = [tracks[i] for i in train_idx] for condition in ['mixture', 'vocals']: self.file_paths['val'][condition] = [track.path[:-11] + condition + '.wav' for track in val_tracks] for condition in ['mixture', 'vocals']: self.file_paths['train'][condition] = [track.path[:-11] + condition + '.wav' for track in train_tracks] self.load_songs() return self def load_songs(self): for set in ['train', 'val']: for condition in ['mixture', 'vocals']: for filepath in self.file_paths[set][condition]: if condition == 'vocals': sequence = util.load_wav(filepath, self.sample_rate) self.sequences[set][condition].append(sequence) self.num_sequences_in_memory += 1 if self.extract_voice_percent > 0: self.voice_indices[set].append(util.get_sequence_with_singing_indices(sequence)) else: if self.in_memory_percentage == 1 or np.random.uniform(0, 1) <= ( self.in_memory_percentage - 0.5) * 2: sequence = util.load_wav(filepath, self.sample_rate) self.sequences[set][condition].append(sequence) self.num_sequences_in_memory += 1 else: self.sequences[set][condition].append([-1]) def get_num_sequences_in_dataset(self): return len(self.sequences['train']['vocals']) + len(self.sequences['train']['mixture']) + len( self.sequences['val']['vocals']) + len(self.sequences['val']['mixture']) def retrieve_sequence(self, set, condition, sequence_num): if len(self.sequences[set][condition][sequence_num]) == 1: sequence = util.load_wav(self.file_paths[set][condition][sequence_num], self.sample_rate) if (float(self.num_sequences_in_memory) / self.get_num_sequences_in_dataset()) < self.in_memory_percentage: self.sequences[set][condition][sequence_num] = sequence self.num_sequences_in_memory += 1 else: sequence = self.sequences[set][condition][sequence_num] return np.array(sequence) def get_random_batch_generator(self, set): if set not in ['train', 'val']: raise ValueError("Argument SET must be either 'train' or 'val'") while True: sample_indices = np.random.randint(0, len(self.sequences[set]['vocals']), self.batch_size) batch_inputs = [] batch_outputs_1 = [] batch_outputs_2 = [] for i, sample_i in enumerate(sample_indices): while True: starting_index = 0 mixture = self.retrieve_sequence(set, 'mixture', sample_i) vocals = self.retrieve_sequence(set, 'vocals', sample_i) accompaniment = mixture - vocals if np.random.uniform(0, 1) < self.extract_voice_percent: indices = self.voice_indices[set][sample_i] vocals_indices, _ = util.get_indices_subsequence(indices) vocals = vocals[vocals_indices[0]:vocals_indices[1]] starting_index = vocals_indices[0] if len(vocals) < self.model.input_length: sample_i = np.random.randint(0, len(self.sequences[set]['vocals'])) else: break offset_1 = np.squeeze(np.random.randint(0, len(vocals) - self.model.input_length + 1, 1)) vocals_fragment = vocals[offset_1:offset_1 + self.model.input_length] offset_2 = offset_1 + starting_index accompaniment_fragment = accompaniment[offset_2:offset_2 + self.model.input_length] input = accompaniment_fragment + vocals_fragment output_vocals = vocals_fragment output_accompaniment = accompaniment_fragment batch_inputs.append(input) batch_outputs_1.append(output_vocals) batch_outputs_2.append(output_accompaniment) batch_inputs = np.array(batch_inputs, dtype='float32') batch_outputs_1 = np.array(batch_outputs_1, dtype='float32') batch_outputs_2 = np.array(batch_outputs_2, dtype='float32') batch_outputs_1 = batch_outputs_1[:, self.model.get_padded_target_field_indices()] batch_outputs_2 = batch_outputs_2[:, self.model.get_padded_target_field_indices()] batch = {'data_input': batch_inputs}, {'data_output_1': batch_outputs_1, 'data_output_2': batch_outputs_2} yield batch def get_condition_input_encode_func(self, representation): if representation == 'binary': return util.binary_encode else: return util.one_hot_encode def get_target_sample_index(self): return int(np.floor(self.fragment_length / 2.0)) def get_samples_of_interest_indices(self, causal=False): if causal: return -1 else: target_sample_index = self.get_target_sample_index() return range(target_sample_index - self.half_target_field_length - self.target_padding, target_sample_index + self.half_target_field_length + self.target_padding + 1) def get_sample_weight_vector_length(self): if self.samples_of_interest_only: return len(self.get_samples_of_interest_indices()) else: return self.fragment_length class MultiInstrumentMUSDB18Dataset(): def __init__(self, config, model): self.model = model self.path = config['dataset']['path'] self.sample_rate = config['dataset']['sample_rate'] self.file_paths = {'train': {'vocals': [], 'mixture': [], 'drums': [], 'other': [], 'bass': []}, 'val': {'vocals': [], 'mixture': [], 'drums': [], 'other': [], 'bass': []}} self.sequences = {'train': {'vocals': [], 'mixture': [], 'drums': [], 'other': [], 'bass': []}, 'val': {'vocals': [], 'mixture': [], 'drums': [], 'other': [], 'bass': []}} self.voice_indices = {'train': [], 'val': []} self.batch_size = config['training']['batch_size'] self.extract_voice_percent = config['dataset']['extract_voice_percentage'] self.in_memory_percentage = config['dataset']['in_memory_percentage'] self.num_sequences_in_memory = 0 self.condition_encode_function = util.get_condition_input_encode_func(config['model']['condition_encoding']) def load_dataset(self): print('Loading MUSDB18 dataset for multi-instrument separation...') mus = musdb.DB(root_dir=self.path, is_wav=True) tracks = mus.load_mus_tracks(subsets='train') np.random.seed(seed=1337) val_idx = np.random.choice(len(tracks), size=25, replace=False) train_idx = [i for i in range(len(tracks)) if i not in val_idx] val_tracks = [tracks[i] for i in val_idx] train_tracks = [tracks[i] for i in train_idx] for condition in ['mixture', 'vocals', 'drums', 'other', 'bass']: self.file_paths['val'][condition] = [track.path[:-11] + condition + '.wav' for track in val_tracks] for condition in ['mixture', 'vocals', 'drums', 'other', 'bass']: self.file_paths['train'][condition] = [track.path[:-11] + condition + '.wav' for track in train_tracks] self.load_songs() return self def load_songs(self): for set in ['train', 'val']: for condition in ['vocals', 'mixture', 'drums', 'other', 'bass']: for filepath in self.file_paths[set][condition]: if condition == 'vocals': sequence = util.load_wav(filepath, self.sample_rate) self.sequences[set][condition].append(sequence) self.num_sequences_in_memory += 1 if self.extract_voice_percent > 0: self.voice_indices[set].append(util.get_sequence_with_singing_indices(sequence)) else: if self.in_memory_percentage == 1 or np.random.uniform(0, 1) <= ( self.in_memory_percentage - 0.5) * 2: sequence = util.load_wav(filepath, self.sample_rate) self.sequences[set][condition].append(sequence) self.num_sequences_in_memory += 1 else: self.sequences[set][condition].append([-1]) def get_num_sequences_in_dataset(self): return len(self.sequences['train']['vocals']) + len(self.sequences['train']['mixture']) + len( self.sequences['val']['vocals']) + len(self.sequences['val']['mixture']) def retrieve_sequence(self, set, condition, sequence_num): if len(self.sequences[set][condition][sequence_num]) == 1: sequence = util.load_wav(self.file_paths[set][condition][sequence_num], self.sample_rate) if (float(self.num_sequences_in_memory) / self.get_num_sequences_in_dataset()) < self.in_memory_percentage: self.sequences[set][condition][sequence_num] = sequence self.num_sequences_in_memory += 1 else: sequence = self.sequences[set][condition][sequence_num] return np.array(sequence) def get_random_batch_generator(self, set): if set not in ['train', 'val']: raise ValueError("Argument SET must be either 'train' or 'val'") while True: sample_indices = np.random.randint(0, len(self.sequences[set]['vocals']), self.batch_size) batch_inputs = [] batch_outputs_1 = [] batch_outputs_2 = [] batch_outputs_3 = [] for i, sample_i in enumerate(sample_indices): while True: starting_index = 0 vocals = self.retrieve_sequence(set, 'vocals', sample_i) bass = self.retrieve_sequence(set, 'bass', sample_i) drums = self.retrieve_sequence(set, 'drums', sample_i) other = self.retrieve_sequence(set, 'other', sample_i) if np.random.uniform(0, 1) < self.extract_voice_percent: indices = self.voice_indices[set][sample_i] vocals_indices, _ = util.get_indices_subsequence(indices) vocals = vocals[vocals_indices[0]:vocals_indices[1]] starting_index = vocals_indices[0] if len(vocals) < self.model.input_length: sample_i = np.random.randint(0, len(self.sequences[set]['vocals'])) else: break offset_1 = np.squeeze(np.random.randint(0, len(vocals) - self.model.input_length + 1, 1)) vocals_fragment = vocals[offset_1:offset_1 + self.model.input_length] offset_2 = offset_1 + starting_index bass_fragment = bass[offset_2:offset_2 + self.model.input_length] drums_fragment = drums[offset_2:offset_2 + self.model.input_length] other_fragment = other[offset_2:offset_2 + self.model.input_length] input = vocals_fragment + bass_fragment + drums_fragment + other_fragment output_vocals = vocals_fragment output_drums = drums_fragment output_bass = bass_fragment batch_inputs.append(input) batch_outputs_1.append(output_vocals) batch_outputs_2.append(output_drums) batch_outputs_3.append(output_bass) batch_inputs = np.array(batch_inputs, dtype='float32') batch_outputs_1 = np.array(batch_outputs_1, dtype='float32') batch_outputs_2 = np.array(batch_outputs_2, dtype='float32') batch_outputs_3 = np.array(batch_outputs_3, dtype='float32') batch_outputs_1 = batch_outputs_1[:, self.model.get_padded_target_field_indices()] batch_outputs_2 = batch_outputs_2[:, self.model.get_padded_target_field_indices()] batch_outputs_3 = batch_outputs_3[:, self.model.get_padded_target_field_indices()] batch = {'data_input': batch_inputs}, {'data_output_1': batch_outputs_1, 'data_output_2': batch_outputs_2, 'data_output_3': batch_outputs_3} yield batch def get_condition_input_encode_func(self, representation): if representation == 'binary': return util.binary_encode else: return util.one_hot_encode def get_target_sample_index(self): return int(np.floor(self.fragment_length / 2.0)) def get_samples_of_interest_indices(self, causal=False): if causal: return -1 else: target_sample_index = self.get_target_sample_index() return range(target_sample_index - self.half_target_field_length - self.target_padding, target_sample_index + self.half_target_field_length + self.target_padding + 1) def get_sample_weight_vector_length(self): if self.samples_of_interest_only: return len(self.get_samples_of_interest_indices()) else: return self.fragment_length ================================================ FILE: environment.yml ================================================ name: sswavenet channels: - anaconda - conda-forge - defaults dependencies: - intel-openmp=2018.0.0=hc7b2577_8 - mkl=2018.0.1=h19d6760_4 - mkl-service=1.1.2=py27hb2d42c5_4 - ca-certificates=2018.1.18=0 - certifi=2018.1.18=py27_0 - h5py=2.7.1=py27_2 - hdf5=1.10.1=2 - keras=2.1.5=py27_0 - libgpuarray=0.7.5=0 - mako=1.0.7=py27_0 - markupsafe=1.0=py27_0 - openssl=1.0.2n=0 - pygpu=0.7.5=py27_0 - pyyaml=3.12=py27_1 - six=1.11.0=py27_1 - theano=1.0.1=py27_1 - yaml=0.1.7=0 - libedit=3.1=heed3624_0 - libffi=3.2.1=hd88cf55_4 - libgcc-ng=7.2.0=hdf63c60_3 - libgfortran=3.0.0=1 - libgfortran-ng=7.2.0=hdf63c60_3 - libstdcxx-ng=7.2.0=hdf63c60_3 - ncurses=6.0=h9df7e31_2 - numpy=1.14.2=py27hdbf6ddf_0 - pip=9.0.1=py27_5 - python=2.7.14=h1571d57_30 - readline=7.0=ha6073c6_4 - scipy=1.0.0=py27hf5f0f52_0 - setuptools=38.5.1=py27_0 - sqlite=3.22.0=h1bed415_0 - tk=8.6.7=hc745277_3 - wheel=0.30.0=py27h2bc6bb2_1 - zlib=1.2.11=ha838bed_2 - pip: - cffi==1.11.5 - functools32==3.2.3.post2 - jsonschema==2.6.0 - musdb==0.2.3 - museval==0.2.0 - pyaml==17.12.1 - pycparser==2.18 - simplejson==3.13.2 - soundfile==0.9.0 - stempeg==0.1.3 - tqdm==4.19.7 ================================================ FILE: layers.py ================================================ # A Wavenet For Source Separation - Francesc Lluis - 25.10.2018 # Layers.py import keras class AddSingletonDepth(keras.layers.Layer): def call(self, x, mask=None): x = keras.backend.expand_dims(x, -1) # add a dimension of the right if keras.backend.ndim(x) == 4: return keras.backend.permute_dimensions(x, (0, 3, 1, 2)) else: return x def compute_output_shape(self, input_shape): if len(input_shape) == 3: return input_shape[0], 1, input_shape[1], input_shape[2] else: return input_shape[0], input_shape[1], 1 class Subtract(keras.layers.Layer): def __init__(self, **kwargs): super(Subtract, self).__init__(**kwargs) def call(self, x, mask=None): return x[0] - x[1] def compute_output_shape(self, input_shape): return input_shape[0] class Add(keras.layers.Layer): def __init__(self, **kwargs): super(Add, self).__init__(**kwargs) def call(self, x, mask=None): output = x[0] for i in range(1, len(x)): output += x[i] return output def compute_output_shape(self, input_shape): return input_shape[0] class Slice(keras.layers.Layer): def __init__(self, selector, output_shape, **kwargs): self.selector = selector self.desired_output_shape = output_shape super(Slice, self).__init__(**kwargs) def call(self, x, mask=None): selector = self.selector if len(self.selector) == 2 and not type(self.selector[1]) is slice and not type(self.selector[1]) is int: x = keras.backend.permute_dimensions(x, [0, 2, 1]) selector = (self.selector[1], self.selector[0]) y = x[selector] if len(self.selector) == 2 and not type(self.selector[1]) is slice and not type(self.selector[1]) is int: y = keras.backend.permute_dimensions(y, [0, 2, 1]) return y def compute_output_shape(self, input_shape): output_shape = (None,) for i, dim_length in enumerate(self.desired_output_shape): if dim_length == Ellipsis: output_shape = output_shape + (input_shape[i+1],) else: output_shape = output_shape + (dim_length,) return output_shape ================================================ FILE: main.py ================================================ # A Wavenet For Source Separation - Francesc Lluis - 25.10.2018 # Main.py import sys import logging import optparse import json import os import models import datasets import util import separate def set_system_settings(): sys.setrecursionlimit(50000) logging.getLogger().setLevel(logging.INFO) def get_command_line_arguments(): parser = optparse.OptionParser() parser.set_defaults(config='sessions/multi-instrument/config.json') parser.set_defaults(mode='training') parser.set_defaults(target='multi-instrument') parser.set_defaults(load_checkpoint=None) parser.set_defaults(condition_value=0) parser.set_defaults(batch_size=None) parser.set_defaults(one_shot=False) parser.set_defaults(mixture_input_path=None) parser.set_defaults(print_model_summary=False) parser.set_defaults(target_field_length=None) parser.add_option('--mode', dest='mode') parser.add_option('--target', dest='target') parser.add_option('--print_model_summary', dest='print_model_summary') parser.add_option('--config', dest='config') parser.add_option('--load_checkpoint', dest='load_checkpoint') parser.add_option('--condition_value', dest='condition_value') parser.add_option('--batch_size', dest='batch_size') parser.add_option('--one_shot', dest='one_shot') parser.add_option('--mixture_input_path', dest='mixture_input_path') parser.add_option('--target_field_length', dest='target_field_length') (options, args) = parser.parse_args() return options def load_config(config_filepath): try: config_file = open(config_filepath, 'r') except IOError: logging.error('No readable config file at path: ' + config_filepath) exit() else: with config_file: return json.load(config_file) def get_dataset(config, cla, model): if config['dataset']['type'] == 'musdb18': if cla.target == 'singing-voice': return datasets.SingingVoiceMUSDB18Dataset(config, model).load_dataset() elif cla.target == 'multi-instrument': return datasets.MultiInstrumentMUSDB18Dataset(config, model).load_dataset() def training(config, cla): # Instantiate Model if cla.target == 'singing-voice': model = models.SingingVoiceSeparationWavenet(config, load_checkpoint=cla.load_checkpoint, print_model_summary=cla.print_model_summary) elif cla.target == 'multi-instrument': model = models.MultiInstrumentSeparationWavenet(config, load_checkpoint=cla.load_checkpoint, print_model_summary=cla.print_model_summary) else: raise Exception("Argument target must be either 'singing-voice' or 'multi-instrument'") dataset = get_dataset(config, cla, model) num_steps_train = config['training']['num_steps_train'] num_steps_val = config['training']['num_steps_test'] train_set_generator = dataset.get_random_batch_generator('train') val_set_generator = dataset.get_random_batch_generator('val') model.fit_model(train_set_generator, num_steps_train, val_set_generator, num_steps_val, config['training']['num_epochs']) def get_valid_output_folder_path(outputs_folder_path): j = 1 while True: output_folder_name = 'samples_%d' % j output_folder_path = os.path.join(outputs_folder_path, output_folder_name) if not os.path.isdir(output_folder_path): os.mkdir(output_folder_path) break j += 1 return output_folder_path def inference(config, cla): if cla.batch_size is not None: batch_size = int(cla.batch_size) else: batch_size = config['training']['batch_size'] if cla.target_field_length is not None: cla.target_field_length = int(cla.target_field_length) if not bool(cla.one_shot): if config['model']['type'] == 'singing-voice': model = models.SingingVoiceSeparationWavenet(config, target_field_length=cla.target_field_length, load_checkpoint=cla.load_checkpoint, print_model_summary=cla.print_model_summary) elif config['model']['type'] == 'multi-instrument': model = models.MultiInstrumentSeparationWavenet(config, target_field_length=cla.target_field_length, load_checkpoint=cla.load_checkpoint, print_model_summary=cla.print_model_summary) print 'Performing inference..' else: print 'Performing one-shot inference..' samples_folder_path = os.path.join(config['training']['path'], 'samples') output_folder_path = get_valid_output_folder_path(samples_folder_path) #If input_path is a single wav file, then set filenames to single element with wav filename if cla.mixture_input_path.endswith('.wav'): filenames = [cla.mixture_input_path.rsplit('/', 1)[-1]] cla.mixture_input_path = cla.mixture_input_path.rsplit('/', 1)[0] + '/' else: if not cla.mixture_input_path.endswith('/'): cla.mixture_input_path += '/' filenames = [filename for filename in os.listdir(cla.mixture_input_path) if filename.endswith('.wav')] for filename in filenames: mixture_input = util.load_wav(cla.mixture_input_path + filename, config['dataset']['sample_rate']) input = {'mixture': mixture_input} output_filename_prefix = filename[0:-4] if bool(cla.one_shot): if len(input['mixture']) % 2 == 0: # If input length is even, remove one sample input['mixture'] = input['mixture'][:-1] if config['model']['type'] == 'singing-voice': model = models.SingingVoiceSeparationWavenet(config, target_field_length=cla.target_field_length, load_checkpoint=cla.load_checkpoint, print_model_summary=cla.print_model_summary) elif config['model']['type'] == 'multi-instrument': model = models.MultiInstrumentSeparationWavenet(config, target_field_length=cla.target_field_length, load_checkpoint=cla.load_checkpoint, print_model_summary=cla.print_model_summary) print "Separating: " + filename separate.separate_sample(model, input, batch_size, output_filename_prefix, config['dataset']['sample_rate'], output_folder_path, config['model']['type']) def main(): set_system_settings() cla = get_command_line_arguments() config = load_config(cla.config) if cla.mode == 'training': training(config, cla) elif cla.mode == 'inference': inference(config, cla) if __name__ == "__main__": main() ================================================ FILE: models.py ================================================ # A Wavenet For Source Separation - Francesc Lluis - 25.10.2018 # Models.py import keras import util import os import numpy as np import layers import logging #Singing Voice Separation Wavenet Model class SingingVoiceSeparationWavenet(): def __init__(self, config, load_checkpoint=None, input_length=None, target_field_length=None, print_model_summary=False): self.config = config self.verbosity = config['training']['verbosity'] self.num_stacks = self.config['model']['num_stacks'] if type(self.config['model']['dilations']) is int: self.dilations = [2 ** i for i in range(0, self.config['model']['dilations'] + 1)] elif type(self.config['model']['dilations']) is list: self.dilations = self.config['model']['dilations'] self.receptive_field_length = util.compute_receptive_field_length(config['model']['num_stacks'], self.dilations, config['model']['filters']['lengths']['res'], 1) if input_length is not None: self.input_length = input_length self.target_field_length = self.input_length - (self.receptive_field_length - 1) if target_field_length is not None: self.target_field_length = target_field_length self.input_length = self.receptive_field_length + (self.target_field_length - 1) else: self.target_field_length = config['model']['target_field_length'] self.input_length = self.receptive_field_length + (self.target_field_length - 1) self.target_padding = config['model']['target_padding'] self.padded_target_field_length = self.target_field_length + 2 * self.target_padding self.half_target_field_length = self.target_field_length / 2 self.half_receptive_field_length = self.receptive_field_length / 2 self.num_residual_blocks = len(self.dilations) * self.num_stacks self.activation = keras.layers.Activation('relu') self.samples_of_interest_indices = self.get_padded_target_field_indices() self.target_sample_indices = self.get_target_field_indices() self.optimizer = self.get_optimizer() self.out_1_loss = self.get_out_1_loss() self.out_2_loss = self.get_out_2_loss() self.metrics = self.get_metrics() self.epoch_num = 0 self.checkpoints_path = '' self.samples_path = '' self.history_filename = '' self.config['model']['num_residual_blocks'] = self.num_residual_blocks self.config['model']['receptive_field_length'] = self.receptive_field_length self.config['model']['input_length'] = self.input_length self.config['model']['target_field_length'] = self.target_field_length self.config['model']['type'] = 'singing-voice' self.model = self.setup_model(load_checkpoint, print_model_summary) def setup_model(self, load_checkpoint=None, print_model_summary=False): self.checkpoints_path = os.path.join(self.config['training']['path'], 'checkpoints') self.samples_path = os.path.join(self.config['training']['path'], 'samples') self.history_filename = 'history_' + self.config['training']['path'][ self.config['training']['path'].rindex('/') + 1:] + '.csv' model = self.build_model() if os.path.exists(self.checkpoints_path) and util.dir_contains_files(self.checkpoints_path): if load_checkpoint is not None: last_checkpoint_path = load_checkpoint self.epoch_num = 0 else: checkpoints = os.listdir(self.checkpoints_path) checkpoints.sort(key=lambda x: os.stat(os.path.join(self.checkpoints_path, x)).st_mtime) last_checkpoint = checkpoints[-1] last_checkpoint_path = os.path.join(self.checkpoints_path, last_checkpoint) self.epoch_num = int(last_checkpoint[11:16]) print 'Loading model from epoch: %d' % self.epoch_num model.load_weights(last_checkpoint_path) else: print 'Building new model...' if not os.path.exists(self.config['training']['path']): os.mkdir(self.config['training']['path']) if not os.path.exists(self.checkpoints_path): os.mkdir(self.checkpoints_path) self.epoch_num = 0 if not os.path.exists(self.samples_path): os.mkdir(self.samples_path) if print_model_summary: model.summary() model.compile(optimizer=self.optimizer, loss={'data_output_1': self.out_1_loss, 'data_output_2': self.out_2_loss}, metrics=self.metrics) self.config['model']['num_params'] = model.count_params() config_path = os.path.join(self.config['training']['path'], 'config.json') if not os.path.exists(config_path): util.pretty_json_dump(self.config, config_path) if print_model_summary: util.pretty_json_dump(self.config) return model def get_optimizer(self): return keras.optimizers.Adam(lr=self.config['optimizer']['lr'], decay=self.config['optimizer']['decay'], epsilon=self.config['optimizer']['epsilon']) def get_out_1_loss(self): if self.config['training']['loss']['out_1']['weight'] == 0: return lambda y_true, y_pred: y_true * 0 return lambda y_true, y_pred: self.config['training']['loss']['out_1']['weight'] * util.l1_l2_loss( y_true, y_pred, self.config['training']['loss']['out_1']['l1'], self.config['training']['loss']['out_1']['l2']) def get_out_2_loss(self): if self.config['training']['loss']['out_2']['weight'] == 0: return lambda y_true, y_pred: y_true * 0 return lambda y_true, y_pred: self.config['training']['loss']['out_2']['weight'] * util.l1_l2_loss( y_true, y_pred, self.config['training']['loss']['out_2']['l1'], self.config['training']['loss']['out_2']['l2']) def get_callbacks(self): return [ keras.callbacks.EarlyStopping(patience=self.config['training']['early_stopping_patience'], verbose=1, monitor='loss'), keras.callbacks.ModelCheckpoint(os.path.join(self.checkpoints_path, 'checkpoint.{epoch:05d}-{val_loss:.3f}.hdf5')), keras.callbacks.CSVLogger(os.path.join(self.config['training']['path'], self.history_filename), append=True) ] def fit_model(self, train_set_generator, num_steps_train, test_set_generator, num_steps_test, num_epochs): print('Fitting model with %d training num steps and %d test num steps...' % (num_steps_train, num_steps_test)) self.model.fit_generator(train_set_generator, num_steps_train, epochs=num_epochs, validation_data=test_set_generator, validation_steps=num_steps_test, callbacks=self.get_callbacks(), verbose=self.verbosity, initial_epoch=self.epoch_num) def separate_batch(self, inputs): return self.model.predict_on_batch(inputs) def get_target_field_indices(self): target_sample_index = self.get_target_sample_index() return range(target_sample_index - self.half_target_field_length, target_sample_index + self.half_target_field_length + 1) def get_padded_target_field_indices(self): target_sample_index = self.get_target_sample_index() return range(target_sample_index - self.half_target_field_length - self.target_padding, target_sample_index + self.half_target_field_length + self.target_padding + 1) def get_target_sample_index(self): return int(np.floor(self.input_length / 2.0)) def get_metrics(self): return [ keras.metrics.mean_absolute_error, self.valid_mean_absolute_error ] def valid_mean_absolute_error(self, y_true, y_pred): return keras.backend.mean( keras.backend.abs(y_true[:, 1:-2] - y_pred[:, 1:-2])) def build_model(self): data_input = keras.engine.Input( shape=(self.input_length,), name='data_input') data_expanded = layers.AddSingletonDepth()(data_input) data_out = keras.layers.Convolution1D(self.config['model']['filters']['depths']['res'], self.config['model']['filters']['lengths']['res'], padding='same', use_bias=False, name='initial_causal_conv')(data_expanded) skip_connections = [] res_block_i = 0 for stack_i in range(self.num_stacks): layer_in_stack = 0 for dilation in self.dilations: res_block_i += 1 data_out, skip_out = self.dilated_residual_block(data_out, res_block_i, layer_in_stack, dilation, stack_i) if skip_out is not None: skip_connections.append(skip_out) layer_in_stack += 1 data_out = keras.layers.Add()(skip_connections) data_out = self.activation(data_out) data_out = keras.layers.Convolution1D(self.config['model']['filters']['depths']['final'][0], self.config['model']['filters']['lengths']['final'][0], padding='same', use_bias=False)(data_out) data_out = self.activation(data_out) data_out = keras.layers.Convolution1D(self.config['model']['filters']['depths']['final'][1], self.config['model']['filters']['lengths']['final'][1], padding='same', use_bias=False)(data_out) data_out = keras.layers.Convolution1D(1, 1)(data_out) data_out_vocals_1 = keras.layers.Lambda(lambda x: keras.backend.squeeze(x, 2), output_shape=lambda shape: (shape[0], shape[1]), name='data_output_1')( data_out) data_out_vocals_2 = keras.layers.Lambda(lambda x: keras.backend.squeeze(x, 2), output_shape=lambda shape: (shape[0], shape[1]), name='data_output_2')( data_out) return keras.engine.Model(inputs=[data_input], outputs=[data_out_vocals_1, data_out_vocals_2]) def dilated_residual_block(self, data_x, res_block_i, layer_i, dilation, stack_i): original_x = data_x data_out = keras.layers.Conv1D(2 * self.config['model']['filters']['depths']['res'], self.config['model']['filters']['lengths']['res'], dilation_rate=dilation, padding='same', use_bias=False, name='res_%d_dilated_conv_d%d_s%d' % ( res_block_i, dilation, stack_i), activation=None)(data_x) data_out_1 = layers.Slice( (Ellipsis, slice(0, self.config['model']['filters']['depths']['res'])), (self.input_length, self.config['model']['filters']['depths']['res']), name='res_%d_data_slice_1_d%d_s%d' % (self.num_residual_blocks, dilation, stack_i))(data_out) data_out_2 = layers.Slice( (Ellipsis, slice(self.config['model']['filters']['depths']['res'], 2 * self.config['model']['filters']['depths']['res'])), (self.input_length, self.config['model']['filters']['depths']['res']), name='res_%d_data_slice_2_d%d_s%d' % (self.num_residual_blocks, dilation, stack_i))(data_out) tanh_out = keras.layers.Activation('tanh')(data_out_1) sigm_out = keras.layers.Activation('sigmoid')(data_out_2) data_x = keras.layers.Multiply(name='res_%d_gated_activation_%d_s%d' % (res_block_i, layer_i, stack_i))( [tanh_out, sigm_out]) data_x = keras.layers.Convolution1D( self.config['model']['filters']['depths']['res'] + self.config['model']['filters']['depths']['skip'], 1, padding='same', use_bias=False)(data_x) res_x = layers.Slice((Ellipsis, slice(0, self.config['model']['filters']['depths']['res'])), (self.input_length, self.config['model']['filters']['depths']['res']), name='res_%d_data_slice_3_d%d_s%d' % (res_block_i, dilation, stack_i))(data_x) skip_x = layers.Slice((Ellipsis, slice(self.config['model']['filters']['depths']['res'], self.config['model']['filters']['depths']['res'] + self.config['model']['filters']['depths']['skip'])), (self.input_length, self.config['model']['filters']['depths']['skip']), name='res_%d_data_slice_4_d%d_s%d' % (res_block_i, dilation, stack_i))(data_x) skip_x = layers.Slice((slice(self.samples_of_interest_indices[0], self.samples_of_interest_indices[-1] + 1, 1), Ellipsis), (self.padded_target_field_length, self.config['model']['filters']['depths']['skip']), name='res_%d_keep_samples_of_interest_d%d_s%d' % (res_block_i, dilation, stack_i))(skip_x) res_x = keras.layers.Add()([original_x, res_x]) return res_x, skip_x # Multi-Instrument Separation Wavenet Model class MultiInstrumentSeparationWavenet(): def __init__(self, config, load_checkpoint=None, input_length=None, target_field_length=None, print_model_summary=False): self.config = config self.verbosity = config['training']['verbosity'] self.num_stacks = self.config['model']['num_stacks'] if type(self.config['model']['dilations']) is int: self.dilations = [2 ** i for i in range(0, self.config['model']['dilations'] + 1)] elif type(self.config['model']['dilations']) is list: self.dilations = self.config['model']['dilations'] self.receptive_field_length = util.compute_receptive_field_length(config['model']['num_stacks'], self.dilations, config['model']['filters']['lengths']['res'], 1) if input_length is not None: self.input_length = input_length self.target_field_length = self.input_length - (self.receptive_field_length - 1) if target_field_length is not None: self.target_field_length = target_field_length self.input_length = self.receptive_field_length + (self.target_field_length - 1) else: self.target_field_length = config['model']['target_field_length'] self.input_length = self.receptive_field_length + (self.target_field_length - 1) self.target_padding = config['model']['target_padding'] self.padded_target_field_length = self.target_field_length + 2 * self.target_padding self.half_target_field_length = self.target_field_length / 2 self.half_receptive_field_length = self.receptive_field_length / 2 self.num_residual_blocks = len(self.dilations) * self.num_stacks self.activation = keras.layers.Activation('relu') self.samples_of_interest_indices = self.get_padded_target_field_indices() self.target_sample_indices = self.get_target_field_indices() self.optimizer = self.get_optimizer() self.out_1_loss = self.get_out_1_loss() self.out_2_loss = self.get_out_2_loss() self.out_3_loss = self.get_out_3_loss() self.metrics = self.get_metrics() self.epoch_num = 0 self.checkpoints_path = '' self.samples_path = '' self.history_filename = '' self.config['model']['num_residual_blocks'] = self.num_residual_blocks self.config['model']['receptive_field_length'] = self.receptive_field_length self.config['model']['input_length'] = self.input_length self.config['model']['target_field_length'] = self.target_field_length self.config['model']['type'] = 'multi-instrument' self.model = self.setup_model(load_checkpoint, print_model_summary) def setup_model(self, load_checkpoint=None, print_model_summary=False): self.checkpoints_path = os.path.join(self.config['training']['path'], 'checkpoints') self.samples_path = os.path.join(self.config['training']['path'], 'samples') self.history_filename = 'history_' + self.config['training']['path'][ self.config['training']['path'].rindex('/') + 1:] + '.csv' model = self.build_model() if os.path.exists(self.checkpoints_path) and util.dir_contains_files(self.checkpoints_path): if load_checkpoint is not None: last_checkpoint_path = load_checkpoint self.epoch_num = 0 else: checkpoints = os.listdir(self.checkpoints_path) checkpoints.sort(key=lambda x: os.stat(os.path.join(self.checkpoints_path, x)).st_mtime) last_checkpoint = checkpoints[-1] last_checkpoint_path = os.path.join(self.checkpoints_path, last_checkpoint) self.epoch_num = int(last_checkpoint[11:16]) print 'Loading model from epoch: %d' % self.epoch_num model.load_weights(last_checkpoint_path) else: print 'Building new model...' if not os.path.exists(self.config['training']['path']): os.mkdir(self.config['training']['path']) if not os.path.exists(self.checkpoints_path): os.mkdir(self.checkpoints_path) self.epoch_num = 0 if not os.path.exists(self.samples_path): os.mkdir(self.samples_path) if print_model_summary: model.summary() model.compile(optimizer=self.optimizer, loss={'data_output_1': self.out_1_loss, 'data_output_2': self.out_2_loss, 'data_output_3': self.out_3_loss}, metrics=self.metrics) self.config['model']['num_params'] = model.count_params() config_path = os.path.join(self.config['training']['path'], 'config.json') if not os.path.exists(config_path): util.pretty_json_dump(self.config, config_path) if print_model_summary: util.pretty_json_dump(self.config) return model def get_optimizer(self): return keras.optimizers.Adam(lr=self.config['optimizer']['lr'], decay=self.config['optimizer']['decay'], epsilon=self.config['optimizer']['epsilon']) def get_out_1_loss(self): if self.config['training']['loss']['out_1']['weight'] == 0: return lambda y_true, y_pred: y_true * 0 return lambda y_true, y_pred: self.config['training']['loss']['out_1']['weight'] * util.l1_l2_loss( y_true, y_pred, self.config['training']['loss']['out_1']['l1'], self.config['training']['loss']['out_1']['l2']) def get_out_2_loss(self): if self.config['training']['loss']['out_2']['weight'] == 0: return lambda y_true, y_pred: y_true * 0 return lambda y_true, y_pred: self.config['training']['loss']['out_2']['weight'] * util.l1_l2_loss( y_true, y_pred, self.config['training']['loss']['out_2']['l1'], self.config['training']['loss']['out_2']['l2']) def get_out_3_loss(self): if self.config['training']['loss']['out_3']['weight'] == 0: return lambda y_true, y_pred: y_true * 0 return lambda y_true, y_pred: self.config['training']['loss']['out_3']['weight'] * util.l1_l2_loss( y_true, y_pred, self.config['training']['loss']['out_3']['l1'], self.config['training']['loss']['out_3']['l2']) def get_callbacks(self): return [ keras.callbacks.EarlyStopping(patience=self.config['training']['early_stopping_patience'], verbose=1, monitor='loss'), keras.callbacks.ModelCheckpoint(os.path.join(self.checkpoints_path, 'checkpoint.{epoch:05d}-{val_loss:.3f}.hdf5')), keras.callbacks.CSVLogger(os.path.join(self.config['training']['path'], self.history_filename), append=True) ] def fit_model(self, train_set_generator, num_steps_train, test_set_generator, num_steps_test, num_epochs): print('Fitting model with %d training num steps and %d test num steps...' % (num_steps_train, num_steps_test)) self.model.fit_generator(train_set_generator, num_steps_train, epochs=num_epochs, validation_data=test_set_generator, validation_steps=num_steps_test, callbacks=self.get_callbacks(), verbose=self.verbosity, initial_epoch=self.epoch_num) def separate_batch(self, inputs): return self.model.predict_on_batch(inputs) def get_target_field_indices(self): target_sample_index = self.get_target_sample_index() return range(target_sample_index - self.half_target_field_length, target_sample_index + self.half_target_field_length + 1) def get_padded_target_field_indices(self): target_sample_index = self.get_target_sample_index() return range(target_sample_index - self.half_target_field_length - self.target_padding, target_sample_index + self.half_target_field_length + self.target_padding + 1) def get_target_sample_index(self): return int(np.floor(self.input_length / 2.0)) def get_metrics(self): return [ keras.metrics.mean_absolute_error, self.valid_mean_absolute_error ] def valid_mean_absolute_error(self, y_true, y_pred): return keras.backend.mean( keras.backend.abs(y_true[:, 1:-2] - y_pred[:, 1:-2])) def build_model(self): data_input = keras.engine.Input( shape=(self.input_length,), name='data_input') data_expanded = layers.AddSingletonDepth()(data_input) data_out = keras.layers.Convolution1D(self.config['model']['filters']['depths']['res'], self.config['model']['filters']['lengths']['res'], padding='same', use_bias=False, name='initial_causal_conv')(data_expanded) skip_connections = [] res_block_i = 0 for stack_i in range(self.num_stacks): layer_in_stack = 0 for dilation in self.dilations: res_block_i += 1 data_out, skip_out = self.dilated_residual_block(data_out, res_block_i, layer_in_stack, dilation, stack_i) if skip_out is not None: skip_connections.append(skip_out) layer_in_stack += 1 data_out = keras.layers.Add()(skip_connections) data_out = self.activation(data_out) data_out = keras.layers.Convolution1D(self.config['model']['filters']['depths']['final'][0], self.config['model']['filters']['lengths']['final'][0], padding='same', use_bias=False)(data_out) data_out = self.activation(data_out) data_out = keras.layers.Convolution1D(self.config['model']['filters']['depths']['final'][1], self.config['model']['filters']['lengths']['final'][1], padding='same', use_bias=False)(data_out) data_out = keras.layers.Convolution1D(3, 1)(data_out) data_out_vocals = layers.Slice((Ellipsis, slice(0, 1)), (self.padded_target_field_length, 1), name='slice_data_output_1')(data_out) data_out_drums = layers.Slice((Ellipsis, slice(1, 2)), (self.padded_target_field_length, 1), name='slice_data_output_2')(data_out) data_out_bass = layers.Slice((Ellipsis, slice(2, 3)), (self.padded_target_field_length, 1), name='slice_data_output_3')(data_out) data_out_vocals = keras.layers.Lambda(lambda x: keras.backend.squeeze(x, 2), output_shape=lambda shape: (shape[0], shape[1]), name='data_output_1')( data_out_vocals) data_out_drums = keras.layers.Lambda(lambda x: keras.backend.squeeze(x, 2), output_shape=lambda shape: (shape[0], shape[1]), name='data_output_2')( data_out_drums) data_out_bass = keras.layers.Lambda(lambda x: keras.backend.squeeze(x, 2), output_shape=lambda shape: (shape[0], shape[1]), name='data_output_3')( data_out_bass) return keras.engine.Model(inputs=[data_input], outputs=[data_out_vocals, data_out_drums, data_out_bass]) def dilated_residual_block(self, data_x, res_block_i, layer_i, dilation, stack_i): original_x = data_x data_out = keras.layers.Conv1D(2 * self.config['model']['filters']['depths']['res'], self.config['model']['filters']['lengths']['res'], dilation_rate=dilation, padding='same', use_bias=False, name='res_%d_dilated_conv_d%d_s%d' % ( res_block_i, dilation, stack_i), activation=None)(data_x) data_out_1 = layers.Slice( (Ellipsis, slice(0, self.config['model']['filters']['depths']['res'])), (self.input_length, self.config['model']['filters']['depths']['res']), name='res_%d_data_slice_1_d%d_s%d' % (self.num_residual_blocks, dilation, stack_i))(data_out) data_out_2 = layers.Slice( (Ellipsis, slice(self.config['model']['filters']['depths']['res'], 2 * self.config['model']['filters']['depths']['res'])), (self.input_length, self.config['model']['filters']['depths']['res']), name='res_%d_data_slice_2_d%d_s%d' % (self.num_residual_blocks, dilation, stack_i))(data_out) tanh_out = keras.layers.Activation('tanh')(data_out_1) sigm_out = keras.layers.Activation('sigmoid')(data_out_2) data_x = keras.layers.Multiply(name='res_%d_gated_activation_%d_s%d' % (res_block_i, layer_i, stack_i))( [tanh_out, sigm_out]) data_x = keras.layers.Convolution1D( self.config['model']['filters']['depths']['res'] + self.config['model']['filters']['depths']['skip'], 1, padding='same', use_bias=False)(data_x) res_x = layers.Slice((Ellipsis, slice(0, self.config['model']['filters']['depths']['res'])), (self.input_length, self.config['model']['filters']['depths']['res']), name='res_%d_data_slice_3_d%d_s%d' % (res_block_i, dilation, stack_i))(data_x) skip_x = layers.Slice((Ellipsis, slice(self.config['model']['filters']['depths']['res'], self.config['model']['filters']['depths']['res'] + self.config['model']['filters']['depths']['skip'])), (self.input_length, self.config['model']['filters']['depths']['skip']), name='res_%d_data_slice_4_d%d_s%d' % (res_block_i, dilation, stack_i))(data_x) skip_x = layers.Slice((slice(self.samples_of_interest_indices[0], self.samples_of_interest_indices[-1] + 1, 1), Ellipsis), (self.padded_target_field_length, self.config['model']['filters']['depths']['skip']), name='res_%d_keep_samples_of_interest_d%d_s%d' % (res_block_i, dilation, stack_i))(skip_x) res_x = keras.layers.Add()([original_x, res_x]) return res_x, skip_x ================================================ FILE: separate.py ================================================ # A Wavenet For Source Separation - Francesc Lluis - 25.10.2018 # Separate.py from __future__ import division import os import util import tqdm import numpy as np def separate_sample(model, input, batch_size, output_filename_prefix, sample_rate, output_path, target): if target == 'singing-voice': if len(input['mixture']) < model.receptive_field_length: raise ValueError('Input is not long enough to be used with this model.') num_output_samples = input['mixture'].shape[0] - (model.receptive_field_length - 1) num_fragments = int(np.ceil(num_output_samples / model.target_field_length)) num_batches = int(np.ceil(num_fragments / batch_size)) vocals_output = [] num_pad_values = 0 fragment_i = 0 for batch_i in tqdm.tqdm(range(0, num_batches)): if batch_i == num_batches - 1: # If its the last batch batch_size = num_fragments - batch_i * batch_size input_batch = np.zeros((batch_size, model.input_length)) # Assemble batch for batch_fragment_i in range(0, batch_size): if fragment_i + model.target_field_length > num_output_samples: remainder = input['mixture'][fragment_i:] current_fragment = np.zeros((model.input_length,)) current_fragment[:remainder.shape[0]] = remainder num_pad_values = model.input_length - remainder.shape[0] else: current_fragment = input['mixture'][fragment_i:fragment_i + model.input_length] input_batch[batch_fragment_i, :] = current_fragment fragment_i += model.target_field_length separated_output_fragments = model.separate_batch({'data_input': input_batch}) if type(separated_output_fragments) is list: vocals_output_fragment = separated_output_fragments[0] vocals_output_fragment = vocals_output_fragment[:, model.target_padding: model.target_padding + model.target_field_length] vocals_output_fragment = vocals_output_fragment.flatten().tolist() if type(separated_output_fragments) is float: vocals_output_fragment = [vocals_output_fragment] vocals_output = vocals_output + vocals_output_fragment vocals_output = np.array(vocals_output) if num_pad_values != 0: vocals_output = vocals_output[:-num_pad_values] mixture_valid_signal = input['mixture'][ model.half_receptive_field_length:model.half_receptive_field_length + len(vocals_output)] accompaniment_output = mixture_valid_signal - vocals_output output_vocals_filename = output_filename_prefix + '_vocals.wav' output_accompaniment_filename = output_filename_prefix + '_accompaniment.wav' output_vocals_filepath = os.path.join(output_path, output_vocals_filename) output_accompaniment_filepath = os.path.join(output_path, output_accompaniment_filename) util.write_wav(vocals_output, output_vocals_filepath, sample_rate) util.write_wav(accompaniment_output, output_accompaniment_filepath, sample_rate) if target == 'multi-instrument': if len(input['mixture']) < model.receptive_field_length: raise ValueError('Input is not long enough to be used with this model.') num_output_samples = input['mixture'].shape[0] - (model.receptive_field_length - 1) num_fragments = int(np.ceil(num_output_samples / model.target_field_length)) num_batches = int(np.ceil(num_fragments / batch_size)) vocals_output = [] drums_output = [] bass_output = [] num_pad_values = 0 fragment_i = 0 for batch_i in tqdm.tqdm(range(0, num_batches)): if batch_i == num_batches - 1: # If its the last batch batch_size = num_fragments - batch_i * batch_size input_batch = np.zeros((batch_size, model.input_length)) # Assemble batch for batch_fragment_i in range(0, batch_size): if fragment_i + model.target_field_length > num_output_samples: remainder = input['mixture'][fragment_i:] current_fragment = np.zeros((model.input_length,)) current_fragment[:remainder.shape[0]] = remainder num_pad_values = model.input_length - remainder.shape[0] else: current_fragment = input['mixture'][fragment_i:fragment_i + model.input_length] input_batch[batch_fragment_i, :] = current_fragment fragment_i += model.target_field_length separated_output_fragments = model.separate_batch({'data_input': input_batch}) if type(separated_output_fragments) is list: vocals_output_fragment = separated_output_fragments[0] drums_output_fragment = separated_output_fragments[1] bass_output_fragment = separated_output_fragments[2] vocals_output_fragment = vocals_output_fragment[:, model.target_padding: model.target_padding + model.target_field_length] vocals_output_fragment = vocals_output_fragment.flatten().tolist() drums_output_fragment = drums_output_fragment[:, model.target_padding: model.target_padding + model.target_field_length] drums_output_fragment = drums_output_fragment.flatten().tolist() bass_output_fragment = bass_output_fragment[:, model.target_padding: model.target_padding + model.target_field_length] bass_output_fragment = bass_output_fragment.flatten().tolist() if type(separated_output_fragments) is float: vocals_output_fragment = [vocals_output_fragment] if type(drums_output_fragment) is float: drums_output_fragment = [drums_output_fragment] if type(bass_output_fragment) is float: bass_output_fragment = [bass_output_fragment] vocals_output = vocals_output + vocals_output_fragment drums_output = drums_output + drums_output_fragment bass_output = bass_output + bass_output_fragment vocals_output = np.array(vocals_output) drums_output = np.array(drums_output) bass_output = np.array(bass_output) if num_pad_values != 0: vocals_output = vocals_output[:-num_pad_values] drums_output = drums_output[:-num_pad_values] bass_output = bass_output[:-num_pad_values] mixture_valid_signal = input['mixture'][ model.half_receptive_field_length:model.half_receptive_field_length + len(vocals_output)] other_output = mixture_valid_signal - vocals_output - drums_output - bass_output output_vocals_filename = output_filename_prefix + '_vocals.wav' output_drums_filename = output_filename_prefix + '_drums.wav' output_bass_filename = output_filename_prefix + '_bass.wav' output_other_filename = output_filename_prefix + '_other.wav' output_vocals_filepath = os.path.join(output_path, output_vocals_filename) output_drums_filepath = os.path.join(output_path, output_drums_filename) output_bass_filepath = os.path.join(output_path, output_bass_filename) output_other_filepath = os.path.join(output_path, output_other_filename) util.write_wav(vocals_output, output_vocals_filepath, sample_rate) util.write_wav(drums_output, output_drums_filepath, sample_rate) util.write_wav(bass_output, output_bass_filepath, sample_rate) util.write_wav(other_output, output_other_filepath, sample_rate) ================================================ FILE: sessions/multi-instrument/checkpoints/checkpoint.00045-0.hdf5 ================================================ [File too large to display: 38.3 MB] ================================================ FILE: sessions/multi-instrument/config.json ================================================ { "dataset": { "extract_voice_percentage": 0, "in_memory_percentage": 1, "path": "MUS", "sample_rate": 16000, "type": "musdb18" }, "model": { "condition_encoding": "binary", "dilations": 9, "filters": { "depths": { "final": [ 2048, 256 ], "res": 64, "skip": 64 }, "lengths": { "final": [ 3, 3 ], "res": 3, "skip": 1 } }, "input_length": 9785, "num_params": 3277763, "num_residual_blocks": 40, "num_stacks": 4, "receptive_field_length": 8185, "target_field_length": 1601, "target_padding": 1, "type": "multi-instrument" }, "optimizer": { "decay": 0.0, "epsilon": 1e-08, "lr": 0.001, "momentum": 0.9, "type": "adam" }, "training": { "batch_size": 10, "early_stopping_patience": 16, "loss": { "out_1": { "l1": 1, "l2": 0, "weight": 1 }, "out_2": { "l1": 1, "l2": 0, "weight": 1 }, "out_3": { "l1": 1, "l2": 0, "weight": 1 } }, "num_epochs": 250, "num_steps_test": 500, "num_steps_train": 2000, "path": "sessions/multi-instrument", "verbosity": 1 } } ================================================ FILE: sessions/singing-voice/checkpoints/checkpoint.00058-0.hdf5 ================================================ [File too large to display: 38.3 MB] ================================================ FILE: sessions/singing-voice/config.json ================================================ { "dataset": { "extract_voice_percentage": 0.5, "in_memory_percentage": 1, "path": "data/MUS", "sample_rate": 16000, "type": "musdb18" }, "model": { "condition_encoding": "binary", "dilations": 9, "filters": { "depths": { "final": [ 2048, 256 ], "res": 64, "skip": 64 }, "lengths": { "final": [ 3, 3 ], "res": 3, "skip": 1 } }, "input_length": 9785, "num_params": 3277249, "num_residual_blocks": 40, "num_stacks": 4, "receptive_field_length": 8185, "target_field_length": 1601, "target_padding": 1, "type": "singing-voice" }, "optimizer": { "decay": 0.0, "epsilon": 1e-08, "lr": 0.001, "momentum": 0.9, "type": "adam" }, "training": { "batch_size": 10, "early_stopping_patience": 16, "loss": { "out_1": { "l1": 1, "l2": 0, "weight": 1 }, "out_2": { "l1": 1, "l2": 0, "weight": -0.05 } }, "num_epochs": 250, "num_steps_test": 500, "num_steps_train": 2000, "path": "sessions/singing-voice", "verbosity": 1 } } ================================================ FILE: util.py ================================================ # A Wavenet For Source Separation - Francesc Lluis - 25.10.2018 # Util.py # Utility functions for dealing with audio signals and training a Source Separation Wavenet import os import numpy as np import json import warnings import scipy.signal import scipy.stats import soundfile as sf import keras import glob def l1_l2_loss(y_true, y_pred, l1_weight, l2_weight): loss = 0 if l1_weight != 0: loss += l1_weight*keras.losses.mean_absolute_error(y_true, y_pred) if l2_weight != 0: loss += l2_weight * keras.losses.mean_squared_error(y_true, y_pred) return loss def compute_receptive_field_length(stacks, dilations, filter_length, target_field_length): half_filter_length = (filter_length-1)/2 length = 0 for d in dilations: length += d*half_filter_length length = 2*length length = stacks * length length += target_field_length return length def wav_to_float(x): try: max_value = np.iinfo(x.dtype).max min_value = np.iinfo(x.dtype).min except: max_value = np.finfo(x.dtype).max min_value = np.finfo(x.dtype).min x = x.astype('float64', casting='safe') x -= min_value x /= ((max_value - min_value) / 2.) x -= 1. return x def float_to_uint8(x): x += 1. x /= 2. uint8_max_value = np.iinfo('uint8').max x *= uint8_max_value x = x.astype('uint8') return x def keras_float_to_uint8(x): x += 1. x /= 2. uint8_max_value = 255 x *= uint8_max_value return x def linear_to_ulaw(x, u=255): x = np.sign(x) * (np.log(1 + u * np.abs(x)) / np.log(1 + u)) return x def keras_linear_to_ulaw(x, u=255.0): x = keras.backend.sign(x) * (keras.backend.log(1 + u * keras.backend.abs(x)) / keras.backend.log(1 + u)) return x def uint8_to_float(x): max_value = np.iinfo('uint8').max min_value = np.iinfo('uint8').min x = x.astype('float32', casting='unsafe') x -= min_value x /= ((max_value - min_value) / 2.) x -= 1. return x def keras_uint8_to_float(x): max_value = 255 min_value = 0 x -= min_value x /= ((max_value - min_value) / 2.) x -= 1. return x def ulaw_to_linear(x, u=255.0): y = np.sign(x) * (1 / float(u)) * (((1 + float(u)) ** np.abs(x)) - 1) return y def keras_ulaw_to_linear(x, u=255.0): y = keras.backend.sign(x) * (1 / u) * (((1 + u) ** keras.backend.abs(x)) - 1) return y def one_hot_encode(x, num_values=256): if isinstance(x, int): x = np.array([x]) if isinstance(x, list): x = np.array(x) return np.eye(num_values, dtype='uint8')[x.astype('uint8')] def one_hot_decode(x): return np.argmax(x, axis=-1) def preemphasis(signal, alpha=0.95): return np.append(signal[0], signal[1:] - alpha * signal[:-1]) def binary_encode(x, max_value): if isinstance(x, int): x = np.array([x]) if isinstance(x, list): x = np.array(x) width = np.ceil(np.log2(max_value)).astype(int) return (((x[:, None] & (1 << np.arange(width)))) > 0).astype(int) def get_condition_input_encode_func(representation): if representation == 'binary': return binary_encode else: return one_hot_encode def ensure_keys_in_dict(keys, dictionary): if all (key in dictionary for key in keys): return True return False def get_subdict_from_dict(keys, dictionary): return dict((k, dictionary[k]) for k in keys if k in dictionary) def pretty_json_dump(values, file_path=None): if file_path is None: print json.dumps(values, sort_keys=True, indent=4, separators=(',', ': ')) else: json.dump(values, open(file_path, 'w'), sort_keys=True, indent=4, separators=(',', ': ')) def read_wav(filename): # Reads in a wav audio file, averages both if stereo, converts the signal to float64 representation audio_signal, sample_rate = sf.read(filename) if audio_signal.ndim > 1: audio_signal = (audio_signal[:, 0] + audio_signal[:, 1])/2.0 if audio_signal.dtype != 'float64': audio_signal = wav_to_float(audio_signal) return audio_signal, sample_rate def load_wav(wav_path, desired_sample_rate): sequence, sample_rate = read_wav(wav_path) sequence = ensure_sample_rate(sequence, desired_sample_rate, sample_rate) return sequence def write_wav(x, filename, sample_rate): if type(x) != np.ndarray: x = np.array(x) with warnings.catch_warnings(): warnings.simplefilter("error") sf.write(filename, x, sample_rate) def ensure_sample_rate(x, desired_sample_rate, file_sample_rate): if file_sample_rate != desired_sample_rate: return scipy.signal.resample_poly(x, desired_sample_rate, file_sample_rate) return x def normalize(x): max_peak = np.max(np.abs(x)) return x / max_peak def get_sequence_with_singing_indices(full_sequence): signal_magnitude = np.abs(full_sequence) chunk_length = 800 chunks_energies = [] for i in xrange(0, len(signal_magnitude), chunk_length): chunks_energies.append(np.mean(signal_magnitude[i:i + chunk_length])) threshold = np.max(chunks_energies) * .1 chunks_energies = np.asarray(chunks_energies) chunks_energies[np.where(chunks_energies < threshold)] = 0 onsets = np.zeros(len(chunks_energies)) onsets[np.nonzero(chunks_energies)] = 1 onsets = np.diff(onsets) start_ind = np.squeeze(np.where(onsets == 1)) finish_ind = np.squeeze(np.where(onsets == -1)) if finish_ind[0] < start_ind[0]: finish_ind = finish_ind[1:] if start_ind[-1] > finish_ind[-1]: start_ind = start_ind[:-1] indices_inici_final = np.insert(finish_ind, np.arange(len(start_ind)), start_ind) return np.squeeze((np.asarray(indices_inici_final) + 1) * chunk_length) def get_indices_subsequence(indices): start_indice = 2 * np.random.randint(0, np.ceil(len(indices) / 2)) vocals_indices = (indices[start_indice], indices[start_indice + 1]) accompaniment_indices = vocals_indices return vocals_indices, accompaniment_indices def contains_voice(fragment, sequence): signal_fragment_magnitude = np.abs(fragment) signal_sequence_magnitude = np.abs(sequence) chunk_length = 800 chunks_fragment_energies = [] for i in xrange(0, len(signal_fragment_magnitude), chunk_length): chunks_fragment_energies.append(np.mean(signal_fragment_magnitude[i:i + chunk_length])) chunks_sequence_energies = [] for i in xrange(0, len(signal_sequence_magnitude), chunk_length): chunks_sequence_energies.append(np.mean(signal_sequence_magnitude[i:i + chunk_length])) threshold = np.max(chunks_sequence_energies) * .1 chunks_fragment_energies = np.asarray(chunks_fragment_energies) chunks_fragment_energies[np.where(chunks_fragment_energies < threshold)] = 0 if np.count_nonzero(chunks_fragment_energies) > 0: return True else: return False def dir_contains_files(path): for f in os.listdir(path): if not f.startswith('.'): return True return False