[
  {
    "path": "LICENSE",
    "content": "MIT License\n\nCopyright (c) 2018 Francesc Lluís Salvadó\n\nPermission is hereby granted, free of charge, to any person obtaining a copy\nof this software and associated documentation files (the \"Software\"), to deal\nin the Software without restriction, including without limitation the rights\nto use, copy, modify, merge, publish, distribute, sublicense, and/or sell\ncopies of the Software, and to permit persons to whom the Software is\nfurnished to do so, subject to the following conditions:\n\nThe above copyright notice and this permission notice shall be included in all\ncopies or substantial portions of the Software.\n\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\nIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\nFITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\nAUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\nLIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\nOUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\nSOFTWARE.\n"
  },
  {
    "path": "README.md",
    "content": "A Wavenet for Music Source Separation\n====\n\nA neural network for end-to-end music source separation, as described in [End-to-end music source separation:\nis it possible in the waveform domain?](https://arxiv.org/abs/1810.12187)\n\nListen to separated samples [here](http://jordipons.me/apps/end-to-end-music-source-separation/)\n\nWhat is a Wavenet for Music Source Separation?\n-----\n\nThe Wavenet for Music Source Separation is a fully convolutional neural network that directly operates on the raw audio waveform.\n\nIt is an adaptation of [Wavenet](https://deepmind.com/blog/wavenet-generative-model-raw-audio/) that turns the original causal model (that is generative and slow), into a non-causal model (that is discriminative and parallelizable). This idea was originally proposed by [Rethage et al.](https://arxiv.org/abs/1706.07162) for speech denoising and now it is adapted for monaural music source separation. Their [code](https://github.com/drethage/speech-denoising-wavenet) is reused.\n\nThe main difference between the original Wavenet and the non-causal adaptation used, is that some samples from the future can be used to predict the present one. As a result of removing the autoregressive causal nature of the original Wavenet, this fully convolutional model is now able to predict a target field instead of one sample at a time – due to this parallelization, it is possible to run the model in real-time on a GPU.\n\n<img src=\"img/wavenet_target_field.jpg\">\n\nSee the diagram below for a summary of the network architecture.\n\n<img src=\"img/wavenet_diagram.jpg\">\n\nInstallation\n-----\n1. `git clone https://github.com/francesclluis/source-separation-wavenet.git`\n2. Install [conda](https://conda.io/docs/user-guide/install/index.html)\n3. `conda env create -f environment.yml`\n4. `source activate sswavenet`\n\n*Currently the project requires **Keras 2.1** and **Theano 1.0.1**, the large dilations present in the architecture are not supported by the current version of Tensorflow*\n\nUsage\n-----\n\nA pre-trained multi-instrument model (best-performing model described in the paper) can be found in `sessions/multi-instrument/checkpoints` and is ready to be used out-of-the-box. The parameterization of this model is specified in `sessions/multi-instrument/config.json`\n\nA pre-trained singing-voice model (best-performing model described in the paper) can be found in `sessions/singing-voice/checkpoints` and is ready to be used out-of-the-box. The parameterization of this model is specified in `sessions/singing-voice/config.json`\n\n*Download the dataset as described [below](https://github.com/francesclluis/source-separation-wavenet#dataset)*\n\n#### Source Separation:\n\nExample (multi-instrument): `THEANO_FLAGS=device=cuda python main.py --mode inference --config sessions/multi-instrument/config.json --mixture_input_path audio/`\n\nExample (singing-voice): `THEANO_FLAGS=device=cuda python main.py --mode inference --config sessions/singing-voice/config.json --mixture_input_path audio/`\n\n###### Speedup\nTo achieve faster source separation, one can increase the target-field length by use of the optional `--target_field_length` argument. This defines the amount of samples that are separated in a single forward propagation, saving redundant calculations. In the following example, it is increased 10x that of when the model was trained, the batch_size is reduced to 4.\n\nFaster Example: `THEANO_FLAGS=device=cuda python main.py --mode inference --target_field_length 16001 --batch_size 4 --config sessions/multi-instrument/config.json --mixture_input_path audio/`\n\n#### Training:\n\nExample (multi-instrument): `THEANO_FLAGS=device=cuda python main.py --mode training --target multi-instrument --config config_multi_instrument.json`\n\nExample (singing-voice): `THEANO_FLAGS=device=cuda python main.py --mode training --target singing-voice --config config_singing_voice.json`\n\n#### Configuration\nA detailed description of all configurable parameters can be found in [config.md](https://github.com/francesclluis/source-separation-wavenet/blob/master/config.md)\n\n#### Optional command-line arguments:\nArgument | Valid Inputs | Default | Description\n-------- | ---- | ---------- | -----\nmode | [training, inference] | training |\ntarget | [multi-instrument, singing-voice] | multi-instrument | Target of the model to train\nconfig | string | config.json | Path to JSON-formatted config file\nprint_model_summary | bool | False | Prints verbose summary of the model\nload_checkpoint | string | None | Path to hdf5 file containing a snapshot of model weights\n\n#### Additional arguments during source separation:\nArgument | Valid Inputs | Default | Description\n-------- | ------------ | ------- | -----------\none_shot | bool | False | Separates each audio file in a single forward propagation\ntarget_field_length | int | as defined in config.json | Overrides parameter in config.json for separating with different target-field lengths than used in training\nbatch_size | int | as defined in config.json | # of samples per batch\n\nDataset\n-----\n\nThe MUSDB18 is used for training the model. It is provided by the Community-Based Signal Separation Evaluation Campaign (SISEC). \n\n1. [Download here](https://sigsep.github.io/datasets/musdb.html#download)\n2. Decode dataset to WAV format as explained [here](https://github.com/sigsep/sigsep-mus-io)\n3. Extract to `data/MUSDB`\n"
  },
  {
    "path": "config.md",
    "content": "config.json - Configuring a training session\n----\n\nThe parameters present in a `config.json` file allow one to configure a training session. Each of these parameters is described below:\n\n### Dataset\nHow the data is used for training\n* **extract_voice_percentage**: (float) Proportion  of  the  data containing  singing  voice  (instead  of  vocal  streams having silence)\n* **in_memory_percentage**: (float) Percentage of the dataset to load into memory, useful when dataset requires more memory than available\n* **path**: (string) Path to dataset\n* **sample_rate**: (int) Sample rate to which all samples should be resampled to\n* **type**: (string) Identifier of which dataset is being used for training\n\n### Model\nWhat the model will be\n* **condition_encoding**: (string) Which numerical representation to encode integer condition values to, either binary or one-hot\n* **dilations**: (int) Maximum dilation factor as an exponent of 2, e.g. dilations = 9 results in a maximum dilation of 2^9 = 512\n* **filters**:\n  * **lengths**:\n    * **res**: (int) Lengths of convolution kernels in residual blocks\n    * **final**: ([int, int]) Lengths of convolution kernels in final layers, individually definable\n    * **skip**: (int) Lengths of convolution kernels in skip connections\n  * **depths**:\n    * **res**: (int) Number of filters in residual-block convolution layers\n    * **skip**: (int) Number of filters in skip connections\n    * **final**: ([int, int]) Number of filters in final layers, individually definable\n* **num_stacks**: (int) Number of stacks, as defined in the paper\n* **target_field_length**: (int) Length of the output\n* **target_padding**: (int) Number of samples used for padding the target_field *per side*\n\n### Training\nHow training will be carried out\n\n* **batch_size**: (int) Number of samples per batch\n* **early_stopping_patience**: (int) Number of epochs to wait without improvement in accuracy before stopping training\n* **loss**: (in the case of multi-instrument)\n  * **out_1**: First term in the three term loss (vocals)\n    * **l1**: (float) Percentage weight given to L1 loss\n    * **l2**: (float) Percentage weight given to L2 loss\n    * **weight**: (float) Percentage weight given to first term\n  * **out_2**: Second term in the three term loss (drums)\n    * **l1**: (float) Percentage weight given to L1 loss\n    * **l2**: (float) Percentage weight given to L2 loss\n    * **weight**: (float) Percentage weight given to second term\n  * **out_3**: Third term in the three term loss  (bass)\n    * **l1**: (float) Percentage weight given to L1 loss\n    * **l2**: (float) Percentage weight given to L2 loss\n    * **weight**: (float) Percentage weight given to third term\n* **loss**: (in the case of singing-voice)\n  * **out_1**: First term in the two term loss (singing voice)\n    * **l1**: (float) Percentage weight given to L1 loss\n    * **l2**: (float) Percentage weight given to L2 loss\n    * **weight**: (float) Percentage weight given to first term\n  * **out_2**: Second term in the two term loss (dissimilarity singing voice)\n    * **l1**: (float) Percentage weight given to L1 loss\n    * **l2**: (float) Percentage weight given to L2 loss\n    * **weight**: (float) Percentage weight given to second term\n* **num_epochs**: (int) Maximum number of epochs to train for\n* **num_steps_test**: (int) Total number of steps (batches of samples) to yield from validation generator before stopping at the end of every epoch.\n* **num_steps_train**: (int) Total number of steps (batches of samples) to yield from training generator before declaring one epoch finished and starting the next epoch.\n* **path**: (string) Path to the folder containing all files pertaining to the training session\n* **verbosity**: (int) Keras verbosity level\n"
  },
  {
    "path": "config_multi_instrument.json",
    "content": "{\n    \"dataset\": {\n        \"in_memory_percentage\": 1,\n\t\"extract_voice_percentage\": 0,\n        \"path\": \"data/MUSDB\",\n        \"sample_rate\": 16000,\n        \"type\": \"musdb18\"\n    },\n    \"model\": {\n        \"condition_encoding\": \"binary\",\n        \"dilations\": 9,\n        \"filters\": {\n            \"lengths\": {\n                \"res\": 3,\n                \"final\": [3, 3],\n                \"skip\": 1\n            },\n            \"depths\": {\n                \"res\": 64,\n                \"skip\": 64,\n                \"final\": [2048, 256]\n            }\n        },\n        \"num_stacks\": 4,\n        \"target_field_length\": 1601,\n        \"target_padding\": 1\n    },\n    \"optimizer\": {\n        \"decay\": 0.0,\n        \"epsilon\": 1e-08,\n        \"lr\": 0.001,\n        \"momentum\": 0.9,\n        \"type\": \"adam\"\n    },\n    \"training\": {\n        \"batch_size\": 10,\n        \"early_stopping_patience\": 16,\n        \"loss\": {\n            \"out_1\": {\n                \"l1\": 1,\n                \"l2\": 0,\n                \"weight\": 1\n            },\n            \"out_2\": {\n                \"l1\": 1,\n                \"l2\": 0,\n                \"weight\": 1\n            },\n            \"out_3\": {\n                \"l1\": 1,\n                \"l2\": 0,\n                \"weight\": 1\n            }\n        },\n        \"num_epochs\": 250,\n        \"num_steps_test\": 500,\n        \"num_steps_train\": 2000,\n        \"path\": \"sessions/003\",\n        \"verbosity\": 1\n    }\n}\n"
  },
  {
    "path": "config_singing_voice.json",
    "content": "{\n    \"dataset\": {\n        \"in_memory_percentage\": 1,\n\t\"extract_voice_percentage\": 0.5,\n        \"path\": \"data/MUSDB\",\n        \"sample_rate\": 16000,\n        \"type\": \"musdb18\"\n    },\n    \"model\": {\n        \"condition_encoding\": \"binary\",\n        \"dilations\": 9,\n        \"filters\": {\n            \"lengths\": {\n                \"res\": 3,\n                \"final\": [3, 3],\n                \"skip\": 1\n            },\n            \"depths\": {\n                \"res\": 64,\n                \"skip\": 64,\n                \"final\": [2048, 256]\n            }\n        },\n        \"num_stacks\": 4,\n        \"target_field_length\": 1601,\n        \"target_padding\": 1\n    },\n    \"optimizer\": {\n        \"decay\": 0.0,\n        \"epsilon\": 1e-08,\n        \"lr\": 0.001,\n        \"momentum\": 0.9,\n        \"type\": \"adam\"\n    },\n    \"training\": {\n        \"batch_size\": 10,\n        \"early_stopping_patience\": 16,\n        \"loss\": {\n            \"out_1\": {\n                \"l1\": 1,\n                \"l2\": 0,\n                \"weight\": 1\n            },\n            \"out_2\": {\n                \"l1\": 1,\n                \"l2\": 0,\n                \"weight\": -0.05\n            }\n        },\n        \"num_epochs\": 250,\n        \"num_steps_test\": 500,\n        \"num_steps_train\": 2000,\n        \"path\": \"sessions/002\",\n        \"verbosity\": 1\n    }\n}\n"
  },
  {
    "path": "datasets.py",
    "content": "# A Wavenet For Source Separation - Francesc Lluis - 25.10.2018\n# Datasets.py\n\nimport util\nimport os\nimport numpy as np\nimport musdb\nimport logging\n\n\nclass SingingVoiceMUSDB18Dataset():\n\n    def __init__(self, config, model):\n        self.model = model\n        self.path = config['dataset']['path']\n        self.sample_rate = config['dataset']['sample_rate']\n        self.file_paths = {'train': {'vocals': [], 'mixture': [], 'drums': [], 'other': [], 'bass': []}, 'val':\n            {'vocals': [], 'mixture': [], 'drums': [], 'other': [], 'bass': []}}\n        self.sequences = {'train': {'vocals': [], 'mixture': [], 'drums': [], 'other': [], 'bass': []}, 'val':\n            {'vocals': [], 'mixture': [], 'drums': [], 'other': [], 'bass': []}}\n        self.voice_indices = {'train': [], 'val': []}\n        self.batch_size = config['training']['batch_size']\n        self.extract_voice_percent = config['dataset']['extract_voice_percentage']\n        self.in_memory_percentage = config['dataset']['in_memory_percentage']\n        self.num_sequences_in_memory = 0\n        self.condition_encode_function = util.get_condition_input_encode_func(config['model']['condition_encoding'])\n\n    def load_dataset(self):\n\n        print('Loading MUSDB18 dataset for singing voice separation...')\n\n        mus = musdb.DB(root_dir=self.path, is_wav=True)\n        tracks = mus.load_mus_tracks(subsets='train')\n        np.random.seed(seed=1337)\n        val_idx = np.random.choice(len(tracks), size=25, replace=False)\n        train_idx = [i for i in range(len(tracks)) if i not in val_idx]\n        val_tracks = [tracks[i] for i in val_idx]\n        train_tracks = [tracks[i] for i in train_idx]\n        for condition in ['mixture', 'vocals']:\n            self.file_paths['val'][condition] = [track.path[:-11] + condition + '.wav' for track in val_tracks]\n        for condition in ['mixture', 'vocals']:\n            self.file_paths['train'][condition] = [track.path[:-11] + condition + '.wav' for track in train_tracks]\n        self.load_songs()\n        return self\n\n    def load_songs(self):\n\n        for set in ['train', 'val']:\n            for condition in ['mixture', 'vocals']:\n                for filepath in self.file_paths[set][condition]:\n\n                    if condition == 'vocals':\n\n                        sequence = util.load_wav(filepath, self.sample_rate)\n                        self.sequences[set][condition].append(sequence)\n                        self.num_sequences_in_memory += 1\n\n                        if self.extract_voice_percent > 0:\n                            self.voice_indices[set].append(util.get_sequence_with_singing_indices(sequence))\n                    else:\n\n                        if self.in_memory_percentage == 1 or np.random.uniform(0, 1) <= (\n                                    self.in_memory_percentage - 0.5) * 2:\n                            sequence = util.load_wav(filepath, self.sample_rate)\n                            self.sequences[set][condition].append(sequence)\n                            self.num_sequences_in_memory += 1\n                        else:\n                            self.sequences[set][condition].append([-1])\n\n    def get_num_sequences_in_dataset(self):\n        return len(self.sequences['train']['vocals']) + len(self.sequences['train']['mixture']) + len(\n            self.sequences['val']['vocals']) + len(self.sequences['val']['mixture'])\n\n    def retrieve_sequence(self, set, condition, sequence_num):\n\n        if len(self.sequences[set][condition][sequence_num]) == 1:\n            sequence = util.load_wav(self.file_paths[set][condition][sequence_num], self.sample_rate)\n\n            if (float(self.num_sequences_in_memory) / self.get_num_sequences_in_dataset()) < self.in_memory_percentage:\n                self.sequences[set][condition][sequence_num] = sequence\n                self.num_sequences_in_memory += 1\n        else:\n            sequence = self.sequences[set][condition][sequence_num]\n\n        return np.array(sequence)\n\n    def get_random_batch_generator(self, set):\n\n        if set not in ['train', 'val']:\n            raise ValueError(\"Argument SET must be either 'train' or 'val'\")\n\n        while True:\n            sample_indices = np.random.randint(0, len(self.sequences[set]['vocals']), self.batch_size)\n            batch_inputs = []\n            batch_outputs_1 = []\n            batch_outputs_2 = []\n\n            for i, sample_i in enumerate(sample_indices):\n\n                while True:\n\n                    starting_index = 0\n\n                    mixture = self.retrieve_sequence(set, 'mixture', sample_i)\n                    vocals = self.retrieve_sequence(set, 'vocals', sample_i)\n                    accompaniment = mixture - vocals\n\n                    if np.random.uniform(0, 1) < self.extract_voice_percent:\n                        indices = self.voice_indices[set][sample_i]\n                        vocals_indices, _ = util.get_indices_subsequence(indices)\n                        vocals = vocals[vocals_indices[0]:vocals_indices[1]]\n                        starting_index = vocals_indices[0]\n\n                    if len(vocals) < self.model.input_length:\n                        sample_i = np.random.randint(0, len(self.sequences[set]['vocals']))\n                    else:\n                        break\n\n                offset_1 = np.squeeze(np.random.randint(0, len(vocals) - self.model.input_length + 1, 1))\n                vocals_fragment = vocals[offset_1:offset_1 + self.model.input_length]\n                offset_2 = offset_1 + starting_index\n                accompaniment_fragment = accompaniment[offset_2:offset_2 + self.model.input_length]\n\n                input = accompaniment_fragment + vocals_fragment\n                output_vocals = vocals_fragment\n                output_accompaniment = accompaniment_fragment\n\n                batch_inputs.append(input)\n                batch_outputs_1.append(output_vocals)\n                batch_outputs_2.append(output_accompaniment)\n\n            batch_inputs = np.array(batch_inputs, dtype='float32')\n            batch_outputs_1 = np.array(batch_outputs_1, dtype='float32')\n            batch_outputs_2 = np.array(batch_outputs_2, dtype='float32')\n            batch_outputs_1 = batch_outputs_1[:, self.model.get_padded_target_field_indices()]\n            batch_outputs_2 = batch_outputs_2[:, self.model.get_padded_target_field_indices()]\n\n            batch = {'data_input': batch_inputs}, {'data_output_1': batch_outputs_1,\n                                                   'data_output_2': batch_outputs_2}\n\n            yield batch\n\n    def get_condition_input_encode_func(self, representation):\n\n        if representation == 'binary':\n            return util.binary_encode\n        else:\n            return util.one_hot_encode\n\n    def get_target_sample_index(self):\n        return int(np.floor(self.fragment_length / 2.0))\n\n    def get_samples_of_interest_indices(self, causal=False):\n\n        if causal:\n            return -1\n        else:\n            target_sample_index = self.get_target_sample_index()\n            return range(target_sample_index - self.half_target_field_length - self.target_padding,\n                         target_sample_index + self.half_target_field_length + self.target_padding + 1)\n\n    def get_sample_weight_vector_length(self):\n        if self.samples_of_interest_only:\n            return len(self.get_samples_of_interest_indices())\n        else:\n            return self.fragment_length\n\n\nclass MultiInstrumentMUSDB18Dataset():\n\n    def __init__(self, config, model):\n        self.model = model\n        self.path = config['dataset']['path']\n        self.sample_rate = config['dataset']['sample_rate']\n        self.file_paths = {'train': {'vocals': [], 'mixture': [], 'drums': [], 'other': [], 'bass': []}, 'val':\n            {'vocals': [], 'mixture': [], 'drums': [], 'other': [], 'bass': []}}\n        self.sequences = {'train': {'vocals': [], 'mixture': [], 'drums': [], 'other': [], 'bass': []}, 'val':\n            {'vocals': [], 'mixture': [], 'drums': [], 'other': [], 'bass': []}}\n        self.voice_indices = {'train': [], 'val': []}\n        self.batch_size = config['training']['batch_size']\n        self.extract_voice_percent = config['dataset']['extract_voice_percentage']\n        self.in_memory_percentage = config['dataset']['in_memory_percentage']\n        self.num_sequences_in_memory = 0\n        self.condition_encode_function = util.get_condition_input_encode_func(config['model']['condition_encoding'])\n\n    def load_dataset(self):\n\n        print('Loading MUSDB18 dataset for multi-instrument separation...')\n\n        mus = musdb.DB(root_dir=self.path, is_wav=True)\n        tracks = mus.load_mus_tracks(subsets='train')\n        np.random.seed(seed=1337)\n        val_idx = np.random.choice(len(tracks), size=25, replace=False)\n        train_idx = [i for i in range(len(tracks)) if i not in val_idx]\n        val_tracks = [tracks[i] for i in val_idx]\n        train_tracks = [tracks[i] for i in train_idx]\n        for condition in ['mixture', 'vocals', 'drums', 'other', 'bass']:\n            self.file_paths['val'][condition] = [track.path[:-11] + condition + '.wav' for track in val_tracks]\n        for condition in ['mixture', 'vocals', 'drums', 'other', 'bass']:\n            self.file_paths['train'][condition] = [track.path[:-11] + condition + '.wav' for track in train_tracks]\n        self.load_songs()\n        return self\n\n    def load_songs(self):\n\n        for set in ['train', 'val']:\n            for condition in ['vocals', 'mixture', 'drums', 'other', 'bass']:\n                for filepath in self.file_paths[set][condition]:\n\n                    if condition == 'vocals':\n\n                        sequence = util.load_wav(filepath, self.sample_rate)\n                        self.sequences[set][condition].append(sequence)\n                        self.num_sequences_in_memory += 1\n\n                        if self.extract_voice_percent > 0:\n                            self.voice_indices[set].append(util.get_sequence_with_singing_indices(sequence))\n                    else:\n\n                        if self.in_memory_percentage == 1 or np.random.uniform(0, 1) <= (\n                                    self.in_memory_percentage - 0.5) * 2:\n                            sequence = util.load_wav(filepath, self.sample_rate)\n                            self.sequences[set][condition].append(sequence)\n                            self.num_sequences_in_memory += 1\n                        else:\n                            self.sequences[set][condition].append([-1])\n\n    def get_num_sequences_in_dataset(self):\n        return len(self.sequences['train']['vocals']) + len(self.sequences['train']['mixture']) + len(\n            self.sequences['val']['vocals']) + len(self.sequences['val']['mixture'])\n\n    def retrieve_sequence(self, set, condition, sequence_num):\n\n        if len(self.sequences[set][condition][sequence_num]) == 1:\n            sequence = util.load_wav(self.file_paths[set][condition][sequence_num], self.sample_rate)\n\n            if (float(self.num_sequences_in_memory) / self.get_num_sequences_in_dataset()) < self.in_memory_percentage:\n                self.sequences[set][condition][sequence_num] = sequence\n                self.num_sequences_in_memory += 1\n        else:\n            sequence = self.sequences[set][condition][sequence_num]\n\n        return np.array(sequence)\n\n    def get_random_batch_generator(self, set):\n\n        if set not in ['train', 'val']:\n            raise ValueError(\"Argument SET must be either 'train' or 'val'\")\n\n        while True:\n            sample_indices = np.random.randint(0, len(self.sequences[set]['vocals']), self.batch_size)\n            batch_inputs = []\n            batch_outputs_1 = []\n            batch_outputs_2 = []\n            batch_outputs_3 = []\n\n            for i, sample_i in enumerate(sample_indices):\n\n                while True:\n\n                    starting_index = 0\n\n                    vocals = self.retrieve_sequence(set, 'vocals', sample_i)\n                    bass = self.retrieve_sequence(set, 'bass', sample_i)\n                    drums = self.retrieve_sequence(set, 'drums', sample_i)\n                    other = self.retrieve_sequence(set, 'other', sample_i)\n\n                    if np.random.uniform(0, 1) < self.extract_voice_percent:\n                        indices = self.voice_indices[set][sample_i]\n                        vocals_indices, _ = util.get_indices_subsequence(indices)\n                        vocals = vocals[vocals_indices[0]:vocals_indices[1]]\n                        starting_index = vocals_indices[0]\n\n                    if len(vocals) < self.model.input_length:\n                        sample_i = np.random.randint(0, len(self.sequences[set]['vocals']))\n                    else:\n                        break\n\n                offset_1 = np.squeeze(np.random.randint(0, len(vocals) - self.model.input_length + 1, 1))\n                vocals_fragment = vocals[offset_1:offset_1 + self.model.input_length]\n                offset_2 = offset_1 + starting_index\n                bass_fragment = bass[offset_2:offset_2 + self.model.input_length]\n                drums_fragment = drums[offset_2:offset_2 + self.model.input_length]\n                other_fragment = other[offset_2:offset_2 + self.model.input_length]\n\n                input = vocals_fragment + bass_fragment + drums_fragment + other_fragment\n                output_vocals = vocals_fragment\n                output_drums = drums_fragment\n                output_bass = bass_fragment\n\n                batch_inputs.append(input)\n                batch_outputs_1.append(output_vocals)\n                batch_outputs_2.append(output_drums)\n                batch_outputs_3.append(output_bass)\n\n            batch_inputs = np.array(batch_inputs, dtype='float32')\n            batch_outputs_1 = np.array(batch_outputs_1, dtype='float32')\n            batch_outputs_2 = np.array(batch_outputs_2, dtype='float32')\n            batch_outputs_3 = np.array(batch_outputs_3, dtype='float32')\n\n            batch_outputs_1 = batch_outputs_1[:, self.model.get_padded_target_field_indices()]\n            batch_outputs_2 = batch_outputs_2[:, self.model.get_padded_target_field_indices()]\n            batch_outputs_3 = batch_outputs_3[:, self.model.get_padded_target_field_indices()]\n\n            batch = {'data_input': batch_inputs}, {'data_output_1': batch_outputs_1,\n                                                   'data_output_2': batch_outputs_2,\n                                                   'data_output_3': batch_outputs_3}\n\n            yield batch\n\n    def get_condition_input_encode_func(self, representation):\n\n        if representation == 'binary':\n            return util.binary_encode\n        else:\n            return util.one_hot_encode\n\n    def get_target_sample_index(self):\n        return int(np.floor(self.fragment_length / 2.0))\n\n    def get_samples_of_interest_indices(self, causal=False):\n\n        if causal:\n            return -1\n        else:\n            target_sample_index = self.get_target_sample_index()\n            return range(target_sample_index - self.half_target_field_length - self.target_padding,\n                         target_sample_index + self.half_target_field_length + self.target_padding + 1)\n\n    def get_sample_weight_vector_length(self):\n        if self.samples_of_interest_only:\n            return len(self.get_samples_of_interest_indices())\n        else:\n            return self.fragment_length\n"
  },
  {
    "path": "environment.yml",
    "content": "name: sswavenet\nchannels:\n  - anaconda\n  - conda-forge\n  - defaults\ndependencies:\n  - intel-openmp=2018.0.0=hc7b2577_8\n  - mkl=2018.0.1=h19d6760_4\n  - mkl-service=1.1.2=py27hb2d42c5_4\n  - ca-certificates=2018.1.18=0\n  - certifi=2018.1.18=py27_0\n  - h5py=2.7.1=py27_2\n  - hdf5=1.10.1=2\n  - keras=2.1.5=py27_0\n  - libgpuarray=0.7.5=0\n  - mako=1.0.7=py27_0\n  - markupsafe=1.0=py27_0\n  - openssl=1.0.2n=0\n  - pygpu=0.7.5=py27_0\n  - pyyaml=3.12=py27_1\n  - six=1.11.0=py27_1\n  - theano=1.0.1=py27_1\n  - yaml=0.1.7=0\n  - libedit=3.1=heed3624_0\n  - libffi=3.2.1=hd88cf55_4\n  - libgcc-ng=7.2.0=hdf63c60_3\n  - libgfortran=3.0.0=1\n  - libgfortran-ng=7.2.0=hdf63c60_3\n  - libstdcxx-ng=7.2.0=hdf63c60_3\n  - ncurses=6.0=h9df7e31_2\n  - numpy=1.14.2=py27hdbf6ddf_0\n  - pip=9.0.1=py27_5\n  - python=2.7.14=h1571d57_30\n  - readline=7.0=ha6073c6_4\n  - scipy=1.0.0=py27hf5f0f52_0\n  - setuptools=38.5.1=py27_0\n  - sqlite=3.22.0=h1bed415_0\n  - tk=8.6.7=hc745277_3\n  - wheel=0.30.0=py27h2bc6bb2_1\n  - zlib=1.2.11=ha838bed_2\n  - pip:\n    - cffi==1.11.5\n    - functools32==3.2.3.post2\n    - jsonschema==2.6.0\n    - musdb==0.2.3\n    - museval==0.2.0\n    - pyaml==17.12.1\n    - pycparser==2.18\n    - simplejson==3.13.2\n    - soundfile==0.9.0\n    - stempeg==0.1.3\n    - tqdm==4.19.7\n"
  },
  {
    "path": "layers.py",
    "content": "# A Wavenet For Source Separation - Francesc Lluis - 25.10.2018\n# Layers.py\n\nimport keras\n\n\nclass AddSingletonDepth(keras.layers.Layer):\n\n    def call(self, x, mask=None):\n        x = keras.backend.expand_dims(x, -1)  # add a dimension of the right\n\n        if keras.backend.ndim(x) == 4:\n            return keras.backend.permute_dimensions(x, (0, 3, 1, 2))\n        else:\n            return x\n\n    def compute_output_shape(self, input_shape):\n        if len(input_shape) == 3:\n            return input_shape[0], 1, input_shape[1], input_shape[2]\n        else:\n            return input_shape[0], input_shape[1], 1\n\n\nclass Subtract(keras.layers.Layer):\n\n    def __init__(self, **kwargs):\n        super(Subtract, self).__init__(**kwargs)\n\n    def call(self, x, mask=None):\n        return x[0] - x[1]\n\n    def compute_output_shape(self, input_shape):\n        return input_shape[0]\n\n\nclass Add(keras.layers.Layer):\n\n    def __init__(self, **kwargs):\n        super(Add, self).__init__(**kwargs)\n\n    def call(self, x, mask=None):\n        output = x[0]\n        for i in range(1, len(x)):\n            output += x[i]\n        return output\n\n    def compute_output_shape(self, input_shape):\n        return input_shape[0]\n\n\nclass Slice(keras.layers.Layer):\n\n    def __init__(self, selector, output_shape, **kwargs):\n        self.selector = selector\n        self.desired_output_shape = output_shape\n        super(Slice, self).__init__(**kwargs)\n\n    def call(self, x, mask=None):\n\n        selector = self.selector\n        if len(self.selector) == 2 and not type(self.selector[1]) is slice and not type(self.selector[1]) is int:\n            x = keras.backend.permute_dimensions(x, [0, 2, 1])\n            selector = (self.selector[1], self.selector[0])\n\n        y = x[selector]\n\n        if len(self.selector) == 2 and not type(self.selector[1]) is slice and not type(self.selector[1]) is int:\n            y = keras.backend.permute_dimensions(y, [0, 2, 1])\n\n        return y\n\n\n    def compute_output_shape(self, input_shape):\n\n        output_shape = (None,)\n        for i, dim_length in enumerate(self.desired_output_shape):\n            if dim_length == Ellipsis:\n                output_shape = output_shape + (input_shape[i+1],)\n            else:\n                output_shape = output_shape + (dim_length,)\n        return output_shape\n"
  },
  {
    "path": "main.py",
    "content": "# A Wavenet For Source Separation - Francesc Lluis - 25.10.2018\n# Main.py\n\nimport sys\nimport logging\nimport optparse\nimport json\nimport os\nimport models\nimport datasets\nimport util\nimport separate\n\n\ndef set_system_settings():\n    sys.setrecursionlimit(50000)\n    logging.getLogger().setLevel(logging.INFO)\n\n\ndef get_command_line_arguments():\n    parser = optparse.OptionParser()\n    parser.set_defaults(config='sessions/multi-instrument/config.json')\n    parser.set_defaults(mode='training')\n    parser.set_defaults(target='multi-instrument')\n    parser.set_defaults(load_checkpoint=None)\n    parser.set_defaults(condition_value=0)\n    parser.set_defaults(batch_size=None)\n    parser.set_defaults(one_shot=False)\n    parser.set_defaults(mixture_input_path=None)\n    parser.set_defaults(print_model_summary=False)\n    parser.set_defaults(target_field_length=None)\n\n    parser.add_option('--mode', dest='mode')\n    parser.add_option('--target', dest='target')\n    parser.add_option('--print_model_summary', dest='print_model_summary')\n    parser.add_option('--config', dest='config')\n    parser.add_option('--load_checkpoint', dest='load_checkpoint')\n    parser.add_option('--condition_value', dest='condition_value')\n    parser.add_option('--batch_size', dest='batch_size')\n    parser.add_option('--one_shot', dest='one_shot')\n    parser.add_option('--mixture_input_path', dest='mixture_input_path')\n    parser.add_option('--target_field_length', dest='target_field_length')\n\n    (options, args) = parser.parse_args()\n\n    return options\n\n\ndef load_config(config_filepath):\n    try:\n        config_file = open(config_filepath, 'r')\n    except IOError:\n        logging.error('No readable config file at path: ' + config_filepath)\n        exit()\n    else:\n        with config_file:\n            return json.load(config_file)\n\n\ndef get_dataset(config, cla, model):\n\n    if config['dataset']['type'] == 'musdb18':\n        if cla.target == 'singing-voice':\n            return datasets.SingingVoiceMUSDB18Dataset(config, model).load_dataset()\n        elif cla.target == 'multi-instrument':\n            return datasets.MultiInstrumentMUSDB18Dataset(config, model).load_dataset()\n\n\ndef training(config, cla):\n\n    # Instantiate Model\n\n    if cla.target == 'singing-voice':\n        model = models.SingingVoiceSeparationWavenet(config, load_checkpoint=cla.load_checkpoint,\n                                                     print_model_summary=cla.print_model_summary)\n    elif cla.target == 'multi-instrument':\n        model = models.MultiInstrumentSeparationWavenet(config, load_checkpoint=cla.load_checkpoint,\n                                                        print_model_summary=cla.print_model_summary)\n    else:\n        raise Exception(\"Argument target must be either 'singing-voice' or 'multi-instrument'\")\n\n    dataset = get_dataset(config, cla, model)\n    num_steps_train = config['training']['num_steps_train']\n    num_steps_val = config['training']['num_steps_test']\n    train_set_generator = dataset.get_random_batch_generator('train')\n    val_set_generator = dataset.get_random_batch_generator('val')\n\n    model.fit_model(train_set_generator, num_steps_train, val_set_generator, num_steps_val,\n                    config['training']['num_epochs'])\n\n\ndef get_valid_output_folder_path(outputs_folder_path):\n    j = 1\n    while True:\n        output_folder_name = 'samples_%d' % j\n        output_folder_path = os.path.join(outputs_folder_path, output_folder_name)\n        if not os.path.isdir(output_folder_path):\n            os.mkdir(output_folder_path)\n            break\n        j += 1\n    return output_folder_path\n\n\ndef inference(config, cla):\n\n    if cla.batch_size is not None:\n        batch_size = int(cla.batch_size)\n    else:\n        batch_size = config['training']['batch_size']\n\n    if cla.target_field_length is not None:\n        cla.target_field_length = int(cla.target_field_length)\n\n    if not bool(cla.one_shot):\n\n        if config['model']['type'] == 'singing-voice':\n            model = models.SingingVoiceSeparationWavenet(config, target_field_length=cla.target_field_length,\n                                                         load_checkpoint=cla.load_checkpoint,\n                                                         print_model_summary=cla.print_model_summary)\n\n        elif config['model']['type'] == 'multi-instrument':\n            model = models.MultiInstrumentSeparationWavenet(config, target_field_length=cla.target_field_length,\n                                                            load_checkpoint=cla.load_checkpoint,\n                                                            print_model_summary=cla.print_model_summary)\n\n        print 'Performing inference..'\n\n    else:\n        print 'Performing one-shot inference..'\n\n    samples_folder_path = os.path.join(config['training']['path'], 'samples')\n    output_folder_path = get_valid_output_folder_path(samples_folder_path)\n\n    #If input_path is a single wav file, then set filenames to single element with wav filename\n    if cla.mixture_input_path.endswith('.wav'):\n        filenames = [cla.mixture_input_path.rsplit('/', 1)[-1]]\n        cla.mixture_input_path = cla.mixture_input_path.rsplit('/', 1)[0] + '/'\n\n    else:\n        if not cla.mixture_input_path.endswith('/'):\n            cla.mixture_input_path += '/'\n        filenames = [filename for filename in os.listdir(cla.mixture_input_path) if filename.endswith('.wav')]\n\n    for filename in filenames:\n        mixture_input = util.load_wav(cla.mixture_input_path + filename, config['dataset']['sample_rate'])\n\n        input = {'mixture': mixture_input}\n\n        output_filename_prefix = filename[0:-4]\n\n        if bool(cla.one_shot):\n            if len(input['mixture']) % 2 == 0:  # If input length is even, remove one sample\n                input['mixture'] = input['mixture'][:-1]\n\n            if config['model']['type'] == 'singing-voice':\n                model = models.SingingVoiceSeparationWavenet(config, target_field_length=cla.target_field_length,\n                                                             load_checkpoint=cla.load_checkpoint,\n                                                             print_model_summary=cla.print_model_summary)\n\n            elif config['model']['type'] == 'multi-instrument':\n                model = models.MultiInstrumentSeparationWavenet(config, target_field_length=cla.target_field_length,\n                                                                load_checkpoint=cla.load_checkpoint,\n                                                                print_model_summary=cla.print_model_summary)\n\n        print \"Separating: \" + filename\n        separate.separate_sample(model, input, batch_size, output_filename_prefix,\n                                 config['dataset']['sample_rate'], output_folder_path, config['model']['type'])\n\n\ndef main():\n\n    set_system_settings()\n    cla = get_command_line_arguments()\n    config = load_config(cla.config)\n\n    if cla.mode == 'training':\n        training(config, cla)\n    elif cla.mode == 'inference':\n        inference(config, cla)\n\n\nif __name__ == \"__main__\":\n    main()\n"
  },
  {
    "path": "models.py",
    "content": "# A Wavenet For Source Separation - Francesc Lluis - 25.10.2018\n# Models.py\n\nimport keras\nimport util\nimport os\nimport numpy as np\nimport layers\nimport logging\n\n#Singing Voice Separation Wavenet Model\n\nclass SingingVoiceSeparationWavenet():\n\n    def __init__(self, config, load_checkpoint=None, input_length=None, target_field_length=None, print_model_summary=False):\n\n        self.config = config\n        self.verbosity = config['training']['verbosity']\n\n        self.num_stacks = self.config['model']['num_stacks']\n        if type(self.config['model']['dilations']) is int:\n            self.dilations = [2 ** i for i in range(0, self.config['model']['dilations'] + 1)]\n        elif type(self.config['model']['dilations']) is list:\n            self.dilations = self.config['model']['dilations']\n\n        self.receptive_field_length = util.compute_receptive_field_length(config['model']['num_stacks'], self.dilations,\n                                                                          config['model']['filters']['lengths']['res'],\n                                                                          1)\n\n        if input_length is not None:\n            self.input_length = input_length\n            self.target_field_length = self.input_length - (self.receptive_field_length - 1)\n        if target_field_length is not None:\n            self.target_field_length = target_field_length\n            self.input_length = self.receptive_field_length + (self.target_field_length - 1)\n        else:\n            self.target_field_length = config['model']['target_field_length']\n            self.input_length = self.receptive_field_length + (self.target_field_length - 1)\n\n        self.target_padding = config['model']['target_padding']\n        self.padded_target_field_length = self.target_field_length + 2 * self.target_padding\n        self.half_target_field_length = self.target_field_length / 2\n        self.half_receptive_field_length = self.receptive_field_length / 2\n        self.num_residual_blocks = len(self.dilations) * self.num_stacks\n        self.activation = keras.layers.Activation('relu')\n        self.samples_of_interest_indices = self.get_padded_target_field_indices()\n        self.target_sample_indices = self.get_target_field_indices()\n\n        self.optimizer = self.get_optimizer()\n        self.out_1_loss = self.get_out_1_loss()\n        self.out_2_loss = self.get_out_2_loss()\n        self.metrics = self.get_metrics()\n        self.epoch_num = 0\n        self.checkpoints_path = ''\n        self.samples_path = ''\n        self.history_filename = ''\n\n        self.config['model']['num_residual_blocks'] = self.num_residual_blocks\n        self.config['model']['receptive_field_length'] = self.receptive_field_length\n        self.config['model']['input_length'] = self.input_length\n        self.config['model']['target_field_length'] = self.target_field_length\n        self.config['model']['type'] = 'singing-voice'\n\n        self.model = self.setup_model(load_checkpoint, print_model_summary)\n\n    def setup_model(self, load_checkpoint=None, print_model_summary=False):\n\n        self.checkpoints_path = os.path.join(self.config['training']['path'], 'checkpoints')\n        self.samples_path = os.path.join(self.config['training']['path'], 'samples')\n        self.history_filename = 'history_' + self.config['training']['path'][\n                                             self.config['training']['path'].rindex('/') + 1:] + '.csv'\n\n        model = self.build_model()\n\n        if os.path.exists(self.checkpoints_path) and util.dir_contains_files(self.checkpoints_path):\n\n            if load_checkpoint is not None:\n                last_checkpoint_path = load_checkpoint\n                self.epoch_num = 0\n            else:\n                checkpoints = os.listdir(self.checkpoints_path)\n                checkpoints.sort(key=lambda x: os.stat(os.path.join(self.checkpoints_path, x)).st_mtime)\n                last_checkpoint = checkpoints[-1]\n                last_checkpoint_path = os.path.join(self.checkpoints_path, last_checkpoint)\n                self.epoch_num = int(last_checkpoint[11:16])\n            print 'Loading model from epoch: %d' % self.epoch_num\n            model.load_weights(last_checkpoint_path)\n\n        else:\n            print 'Building new model...'\n\n            if not os.path.exists(self.config['training']['path']):\n                os.mkdir(self.config['training']['path'])\n\n            if not os.path.exists(self.checkpoints_path):\n                os.mkdir(self.checkpoints_path)\n\n            self.epoch_num = 0\n\n        if not os.path.exists(self.samples_path):\n            os.mkdir(self.samples_path)\n\n        if print_model_summary:\n            model.summary()\n\n        model.compile(optimizer=self.optimizer,\n                      loss={'data_output_1': self.out_1_loss, 'data_output_2': self.out_2_loss}, metrics=self.metrics)\n        self.config['model']['num_params'] = model.count_params()\n\n        config_path = os.path.join(self.config['training']['path'], 'config.json')\n        if not os.path.exists(config_path):\n            util.pretty_json_dump(self.config, config_path)\n\n        if print_model_summary:\n            util.pretty_json_dump(self.config)\n        return model\n\n    def get_optimizer(self):\n\n        return keras.optimizers.Adam(lr=self.config['optimizer']['lr'], decay=self.config['optimizer']['decay'],\n                                     epsilon=self.config['optimizer']['epsilon'])\n\n    def get_out_1_loss(self):\n\n        if self.config['training']['loss']['out_1']['weight'] == 0:\n            return lambda y_true, y_pred: y_true * 0\n\n        return lambda y_true, y_pred: self.config['training']['loss']['out_1']['weight'] * util.l1_l2_loss(\n            y_true, y_pred, self.config['training']['loss']['out_1']['l1'],\n            self.config['training']['loss']['out_1']['l2'])\n\n    def get_out_2_loss(self):\n\n        if self.config['training']['loss']['out_2']['weight'] == 0:\n            return lambda y_true, y_pred: y_true * 0\n\n        return lambda y_true, y_pred: self.config['training']['loss']['out_2']['weight'] * util.l1_l2_loss(\n            y_true, y_pred, self.config['training']['loss']['out_2']['l1'],\n            self.config['training']['loss']['out_2']['l2'])\n\n    def get_callbacks(self):\n\n        return [\n            keras.callbacks.EarlyStopping(patience=self.config['training']['early_stopping_patience'], verbose=1,\n                                          monitor='loss'),\n            keras.callbacks.ModelCheckpoint(os.path.join(self.checkpoints_path,\n                                                         'checkpoint.{epoch:05d}-{val_loss:.3f}.hdf5')),\n            keras.callbacks.CSVLogger(os.path.join(self.config['training']['path'], self.history_filename), append=True)\n        ]\n\n    def fit_model(self, train_set_generator, num_steps_train, test_set_generator, num_steps_test, num_epochs):\n\n        print('Fitting model with %d training num steps and %d test num steps...' % (num_steps_train, num_steps_test))\n\n        self.model.fit_generator(train_set_generator,\n                                 num_steps_train,\n                                 epochs=num_epochs,\n                                 validation_data=test_set_generator,\n                                 validation_steps=num_steps_test,\n                                 callbacks=self.get_callbacks(),\n                                 verbose=self.verbosity,\n                                 initial_epoch=self.epoch_num)\n\n    def separate_batch(self, inputs):\n        return self.model.predict_on_batch(inputs)\n\n    def get_target_field_indices(self):\n\n        target_sample_index = self.get_target_sample_index()\n\n        return range(target_sample_index - self.half_target_field_length,\n                     target_sample_index + self.half_target_field_length + 1)\n\n    def get_padded_target_field_indices(self):\n\n        target_sample_index = self.get_target_sample_index()\n\n        return range(target_sample_index - self.half_target_field_length - self.target_padding,\n                     target_sample_index + self.half_target_field_length + self.target_padding + 1)\n\n    def get_target_sample_index(self):\n        return int(np.floor(self.input_length / 2.0))\n\n    def get_metrics(self):\n\n        return [\n            keras.metrics.mean_absolute_error,\n            self.valid_mean_absolute_error\n        ]\n\n    def valid_mean_absolute_error(self, y_true, y_pred):\n        return keras.backend.mean(\n            keras.backend.abs(y_true[:, 1:-2] - y_pred[:, 1:-2]))\n\n    def build_model(self):\n\n        data_input = keras.engine.Input(\n                shape=(self.input_length,),\n                name='data_input')\n\n        data_expanded = layers.AddSingletonDepth()(data_input)\n\n        data_out = keras.layers.Convolution1D(self.config['model']['filters']['depths']['res'],\n                                              self.config['model']['filters']['lengths']['res'], padding='same',\n                                              use_bias=False,\n                                              name='initial_causal_conv')(data_expanded)\n\n        skip_connections = []\n        res_block_i = 0\n        for stack_i in range(self.num_stacks):\n            layer_in_stack = 0\n            for dilation in self.dilations:\n                res_block_i += 1\n                data_out, skip_out = self.dilated_residual_block(data_out, res_block_i, layer_in_stack, dilation, stack_i)\n                if skip_out is not None:\n                    skip_connections.append(skip_out)\n                layer_in_stack += 1\n\n        data_out = keras.layers.Add()(skip_connections)\n\n        data_out = self.activation(data_out)\n\n        data_out = keras.layers.Convolution1D(self.config['model']['filters']['depths']['final'][0],\n                                              self.config['model']['filters']['lengths']['final'][0],\n                                              padding='same',\n                                              use_bias=False)(data_out)\n\n        data_out = self.activation(data_out)\n        data_out = keras.layers.Convolution1D(self.config['model']['filters']['depths']['final'][1],\n                                              self.config['model']['filters']['lengths']['final'][1], padding='same',\n                                              use_bias=False)(data_out)\n\n        data_out = keras.layers.Convolution1D(1, 1)(data_out)\n\n        data_out_vocals_1 = keras.layers.Lambda(lambda x: keras.backend.squeeze(x, 2),\n                                              output_shape=lambda shape: (shape[0], shape[1]), name='data_output_1')(\n            data_out)\n\n        data_out_vocals_2 = keras.layers.Lambda(lambda x: keras.backend.squeeze(x, 2),\n                                                output_shape=lambda shape: (shape[0], shape[1]), name='data_output_2')(\n            data_out)\n\n        return keras.engine.Model(inputs=[data_input], outputs=[data_out_vocals_1, data_out_vocals_2])\n\n    def dilated_residual_block(self, data_x, res_block_i, layer_i, dilation, stack_i):\n\n        original_x = data_x\n\n        data_out = keras.layers.Conv1D(2 * self.config['model']['filters']['depths']['res'],\n                                                    self.config['model']['filters']['lengths']['res'],\n                                                    dilation_rate=dilation, padding='same',\n                                                    use_bias=False,\n                                                    name='res_%d_dilated_conv_d%d_s%d' % (\n                                                    res_block_i, dilation, stack_i),\n                                                    activation=None)(data_x)\n\n        data_out_1 = layers.Slice(\n            (Ellipsis, slice(0, self.config['model']['filters']['depths']['res'])),\n            (self.input_length, self.config['model']['filters']['depths']['res']),\n            name='res_%d_data_slice_1_d%d_s%d' % (self.num_residual_blocks, dilation, stack_i))(data_out)\n\n        data_out_2 = layers.Slice(\n            (Ellipsis, slice(self.config['model']['filters']['depths']['res'],\n                             2 * self.config['model']['filters']['depths']['res'])),\n            (self.input_length, self.config['model']['filters']['depths']['res']),\n            name='res_%d_data_slice_2_d%d_s%d' % (self.num_residual_blocks, dilation, stack_i))(data_out)\n\n        tanh_out = keras.layers.Activation('tanh')(data_out_1)\n        sigm_out = keras.layers.Activation('sigmoid')(data_out_2)\n\n        data_x = keras.layers.Multiply(name='res_%d_gated_activation_%d_s%d' % (res_block_i, layer_i, stack_i))(\n            [tanh_out, sigm_out])\n\n        data_x = keras.layers.Convolution1D(\n            self.config['model']['filters']['depths']['res'] + self.config['model']['filters']['depths']['skip'], 1,\n            padding='same', use_bias=False)(data_x)\n\n        res_x = layers.Slice((Ellipsis, slice(0, self.config['model']['filters']['depths']['res'])),\n                             (self.input_length, self.config['model']['filters']['depths']['res']),\n                             name='res_%d_data_slice_3_d%d_s%d' % (res_block_i, dilation, stack_i))(data_x)\n\n        skip_x = layers.Slice((Ellipsis, slice(self.config['model']['filters']['depths']['res'],\n                                               self.config['model']['filters']['depths']['res'] +\n                                               self.config['model']['filters']['depths']['skip'])),\n                              (self.input_length, self.config['model']['filters']['depths']['skip']),\n                              name='res_%d_data_slice_4_d%d_s%d' % (res_block_i, dilation, stack_i))(data_x)\n\n        skip_x = layers.Slice((slice(self.samples_of_interest_indices[0], self.samples_of_interest_indices[-1] + 1, 1),\n                               Ellipsis), (self.padded_target_field_length, self.config['model']['filters']['depths']['skip']),\n                              name='res_%d_keep_samples_of_interest_d%d_s%d' % (res_block_i, dilation, stack_i))(skip_x)\n\n        res_x = keras.layers.Add()([original_x, res_x])\n\n        return res_x, skip_x\n\n\n# Multi-Instrument Separation Wavenet Model\n\nclass MultiInstrumentSeparationWavenet():\n\n    def __init__(self, config, load_checkpoint=None, input_length=None, target_field_length=None, print_model_summary=False):\n\n        self.config = config\n        self.verbosity = config['training']['verbosity']\n\n        self.num_stacks = self.config['model']['num_stacks']\n        if type(self.config['model']['dilations']) is int:\n            self.dilations = [2 ** i for i in range(0, self.config['model']['dilations'] + 1)]\n        elif type(self.config['model']['dilations']) is list:\n            self.dilations = self.config['model']['dilations']\n\n        self.receptive_field_length = util.compute_receptive_field_length(config['model']['num_stacks'], self.dilations,\n                                                                          config['model']['filters']['lengths']['res'],\n                                                                          1)\n\n        if input_length is not None:\n            self.input_length = input_length\n            self.target_field_length = self.input_length - (self.receptive_field_length - 1)\n        if target_field_length is not None:\n            self.target_field_length = target_field_length\n            self.input_length = self.receptive_field_length + (self.target_field_length - 1)\n        else:\n            self.target_field_length = config['model']['target_field_length']\n            self.input_length = self.receptive_field_length + (self.target_field_length - 1)\n\n        self.target_padding = config['model']['target_padding']\n        self.padded_target_field_length = self.target_field_length + 2 * self.target_padding\n        self.half_target_field_length = self.target_field_length / 2\n        self.half_receptive_field_length = self.receptive_field_length / 2\n        self.num_residual_blocks = len(self.dilations) * self.num_stacks\n        self.activation = keras.layers.Activation('relu')\n        self.samples_of_interest_indices = self.get_padded_target_field_indices()\n        self.target_sample_indices = self.get_target_field_indices()\n\n        self.optimizer = self.get_optimizer()\n        self.out_1_loss = self.get_out_1_loss()\n        self.out_2_loss = self.get_out_2_loss()\n        self.out_3_loss = self.get_out_3_loss()\n        self.metrics = self.get_metrics()\n        self.epoch_num = 0\n        self.checkpoints_path = ''\n        self.samples_path = ''\n        self.history_filename = ''\n\n        self.config['model']['num_residual_blocks'] = self.num_residual_blocks\n        self.config['model']['receptive_field_length'] = self.receptive_field_length\n        self.config['model']['input_length'] = self.input_length\n        self.config['model']['target_field_length'] = self.target_field_length\n        self.config['model']['type'] = 'multi-instrument'\n\n        self.model = self.setup_model(load_checkpoint, print_model_summary)\n\n    def setup_model(self, load_checkpoint=None, print_model_summary=False):\n\n        self.checkpoints_path = os.path.join(self.config['training']['path'], 'checkpoints')\n        self.samples_path = os.path.join(self.config['training']['path'], 'samples')\n        self.history_filename = 'history_' + self.config['training']['path'][\n                                             self.config['training']['path'].rindex('/') + 1:] + '.csv'\n\n        model = self.build_model()\n\n        if os.path.exists(self.checkpoints_path) and util.dir_contains_files(self.checkpoints_path):\n\n            if load_checkpoint is not None:\n                last_checkpoint_path = load_checkpoint\n                self.epoch_num = 0\n            else:\n                checkpoints = os.listdir(self.checkpoints_path)\n                checkpoints.sort(key=lambda x: os.stat(os.path.join(self.checkpoints_path, x)).st_mtime)\n                last_checkpoint = checkpoints[-1]\n                last_checkpoint_path = os.path.join(self.checkpoints_path, last_checkpoint)\n                self.epoch_num = int(last_checkpoint[11:16])\n            print 'Loading model from epoch: %d' % self.epoch_num\n            model.load_weights(last_checkpoint_path)\n\n        else:\n            print 'Building new model...'\n\n            if not os.path.exists(self.config['training']['path']):\n                os.mkdir(self.config['training']['path'])\n\n            if not os.path.exists(self.checkpoints_path):\n                os.mkdir(self.checkpoints_path)\n\n            self.epoch_num = 0\n\n        if not os.path.exists(self.samples_path):\n            os.mkdir(self.samples_path)\n\n        if print_model_summary:\n            model.summary()\n\n        model.compile(optimizer=self.optimizer,\n                      loss={'data_output_1': self.out_1_loss, 'data_output_2': self.out_2_loss,\n                            'data_output_3': self.out_3_loss}, metrics=self.metrics)\n        self.config['model']['num_params'] = model.count_params()\n\n        config_path = os.path.join(self.config['training']['path'], 'config.json')\n        if not os.path.exists(config_path):\n            util.pretty_json_dump(self.config, config_path)\n\n        if print_model_summary:\n            util.pretty_json_dump(self.config)\n        return model\n\n    def get_optimizer(self):\n\n        return keras.optimizers.Adam(lr=self.config['optimizer']['lr'], decay=self.config['optimizer']['decay'],\n                                     epsilon=self.config['optimizer']['epsilon'])\n\n    def get_out_1_loss(self):\n\n        if self.config['training']['loss']['out_1']['weight'] == 0:\n            return lambda y_true, y_pred: y_true * 0\n\n        return lambda y_true, y_pred: self.config['training']['loss']['out_1']['weight'] * util.l1_l2_loss(\n            y_true, y_pred, self.config['training']['loss']['out_1']['l1'],\n            self.config['training']['loss']['out_1']['l2'])\n\n    def get_out_2_loss(self):\n\n        if self.config['training']['loss']['out_2']['weight'] == 0:\n            return lambda y_true, y_pred: y_true * 0\n\n        return lambda y_true, y_pred: self.config['training']['loss']['out_2']['weight'] * util.l1_l2_loss(\n            y_true, y_pred, self.config['training']['loss']['out_2']['l1'],\n            self.config['training']['loss']['out_2']['l2'])\n\n    def get_out_3_loss(self):\n        if self.config['training']['loss']['out_3']['weight'] == 0:\n            return lambda y_true, y_pred: y_true * 0\n\n        return lambda y_true, y_pred: self.config['training']['loss']['out_3']['weight'] * util.l1_l2_loss(\n            y_true, y_pred, self.config['training']['loss']['out_3']['l1'],\n            self.config['training']['loss']['out_3']['l2'])\n\n    def get_callbacks(self):\n\n        return [\n            keras.callbacks.EarlyStopping(patience=self.config['training']['early_stopping_patience'], verbose=1,\n                                          monitor='loss'),\n            keras.callbacks.ModelCheckpoint(os.path.join(self.checkpoints_path,\n                                                         'checkpoint.{epoch:05d}-{val_loss:.3f}.hdf5')),\n            keras.callbacks.CSVLogger(os.path.join(self.config['training']['path'], self.history_filename), append=True)\n        ]\n\n    def fit_model(self, train_set_generator, num_steps_train, test_set_generator, num_steps_test, num_epochs):\n\n        print('Fitting model with %d training num steps and %d test num steps...' % (num_steps_train, num_steps_test))\n\n        self.model.fit_generator(train_set_generator,\n                                 num_steps_train,\n                                 epochs=num_epochs,\n                                 validation_data=test_set_generator,\n                                 validation_steps=num_steps_test,\n                                 callbacks=self.get_callbacks(),\n                                 verbose=self.verbosity,\n                                 initial_epoch=self.epoch_num)\n\n    def separate_batch(self, inputs):\n        return self.model.predict_on_batch(inputs)\n\n    def get_target_field_indices(self):\n\n        target_sample_index = self.get_target_sample_index()\n\n        return range(target_sample_index - self.half_target_field_length,\n                     target_sample_index + self.half_target_field_length + 1)\n\n    def get_padded_target_field_indices(self):\n\n        target_sample_index = self.get_target_sample_index()\n\n        return range(target_sample_index - self.half_target_field_length - self.target_padding,\n                     target_sample_index + self.half_target_field_length + self.target_padding + 1)\n\n    def get_target_sample_index(self):\n        return int(np.floor(self.input_length / 2.0))\n\n    def get_metrics(self):\n\n        return [\n            keras.metrics.mean_absolute_error,\n            self.valid_mean_absolute_error\n        ]\n\n    def valid_mean_absolute_error(self, y_true, y_pred):\n        return keras.backend.mean(\n            keras.backend.abs(y_true[:, 1:-2] - y_pred[:, 1:-2]))\n\n    def build_model(self):\n\n        data_input = keras.engine.Input(\n                shape=(self.input_length,),\n                name='data_input')\n\n        data_expanded = layers.AddSingletonDepth()(data_input)\n\n        data_out = keras.layers.Convolution1D(self.config['model']['filters']['depths']['res'],\n                                              self.config['model']['filters']['lengths']['res'], padding='same',\n                                              use_bias=False,\n                                              name='initial_causal_conv')(data_expanded)\n\n        skip_connections = []\n        res_block_i = 0\n        for stack_i in range(self.num_stacks):\n            layer_in_stack = 0\n            for dilation in self.dilations:\n                res_block_i += 1\n                data_out, skip_out = self.dilated_residual_block(data_out, res_block_i, layer_in_stack, dilation, stack_i)\n                if skip_out is not None:\n                    skip_connections.append(skip_out)\n                layer_in_stack += 1\n\n        data_out = keras.layers.Add()(skip_connections)\n\n        data_out = self.activation(data_out)\n\n        data_out = keras.layers.Convolution1D(self.config['model']['filters']['depths']['final'][0],\n                                              self.config['model']['filters']['lengths']['final'][0],\n                                              padding='same',\n                                              use_bias=False)(data_out)\n\n        data_out = self.activation(data_out)\n        data_out = keras.layers.Convolution1D(self.config['model']['filters']['depths']['final'][1],\n                                              self.config['model']['filters']['lengths']['final'][1], padding='same',\n                                              use_bias=False)(data_out)\n\n        data_out = keras.layers.Convolution1D(3, 1)(data_out)\n\n        data_out_vocals = layers.Slice((Ellipsis, slice(0, 1)), (self.padded_target_field_length, 1),\n                                       name='slice_data_output_1')(data_out)\n        data_out_drums = layers.Slice((Ellipsis, slice(1, 2)), (self.padded_target_field_length, 1),\n                                              name='slice_data_output_2')(data_out)\n        data_out_bass = layers.Slice((Ellipsis, slice(2, 3)), (self.padded_target_field_length, 1),\n                                      name='slice_data_output_3')(data_out)\n\n        data_out_vocals = keras.layers.Lambda(lambda x: keras.backend.squeeze(x, 2),\n                                              output_shape=lambda shape: (shape[0], shape[1]), name='data_output_1')(\n            data_out_vocals)\n\n        data_out_drums = keras.layers.Lambda(lambda x: keras.backend.squeeze(x, 2),\n                                              output_shape=lambda shape: (shape[0], shape[1]), name='data_output_2')(\n            data_out_drums)\n\n        data_out_bass = keras.layers.Lambda(lambda x: keras.backend.squeeze(x, 2),\n                                              output_shape=lambda shape: (shape[0], shape[1]), name='data_output_3')(\n            data_out_bass)\n\n        return keras.engine.Model(inputs=[data_input], outputs=[data_out_vocals, data_out_drums, data_out_bass])\n\n    def dilated_residual_block(self, data_x, res_block_i, layer_i, dilation, stack_i):\n\n        original_x = data_x\n\n        data_out = keras.layers.Conv1D(2 * self.config['model']['filters']['depths']['res'],\n                                                    self.config['model']['filters']['lengths']['res'],\n                                                    dilation_rate=dilation, padding='same',\n                                                    use_bias=False,\n                                                    name='res_%d_dilated_conv_d%d_s%d' % (\n                                                    res_block_i, dilation, stack_i),\n                                                    activation=None)(data_x)\n\n        data_out_1 = layers.Slice(\n            (Ellipsis, slice(0, self.config['model']['filters']['depths']['res'])),\n            (self.input_length, self.config['model']['filters']['depths']['res']),\n            name='res_%d_data_slice_1_d%d_s%d' % (self.num_residual_blocks, dilation, stack_i))(data_out)\n\n        data_out_2 = layers.Slice(\n            (Ellipsis, slice(self.config['model']['filters']['depths']['res'],\n                             2 * self.config['model']['filters']['depths']['res'])),\n            (self.input_length, self.config['model']['filters']['depths']['res']),\n            name='res_%d_data_slice_2_d%d_s%d' % (self.num_residual_blocks, dilation, stack_i))(data_out)\n\n        tanh_out = keras.layers.Activation('tanh')(data_out_1)\n        sigm_out = keras.layers.Activation('sigmoid')(data_out_2)\n\n        data_x = keras.layers.Multiply(name='res_%d_gated_activation_%d_s%d' % (res_block_i, layer_i, stack_i))(\n            [tanh_out, sigm_out])\n\n        data_x = keras.layers.Convolution1D(\n            self.config['model']['filters']['depths']['res'] + self.config['model']['filters']['depths']['skip'], 1,\n            padding='same', use_bias=False)(data_x)\n\n        res_x = layers.Slice((Ellipsis, slice(0, self.config['model']['filters']['depths']['res'])),\n                             (self.input_length, self.config['model']['filters']['depths']['res']),\n                             name='res_%d_data_slice_3_d%d_s%d' % (res_block_i, dilation, stack_i))(data_x)\n\n        skip_x = layers.Slice((Ellipsis, slice(self.config['model']['filters']['depths']['res'],\n                                               self.config['model']['filters']['depths']['res'] +\n                                               self.config['model']['filters']['depths']['skip'])),\n                              (self.input_length, self.config['model']['filters']['depths']['skip']),\n                              name='res_%d_data_slice_4_d%d_s%d' % (res_block_i, dilation, stack_i))(data_x)\n\n        skip_x = layers.Slice((slice(self.samples_of_interest_indices[0], self.samples_of_interest_indices[-1] + 1, 1),\n                               Ellipsis), (self.padded_target_field_length, self.config['model']['filters']['depths']['skip']),\n                              name='res_%d_keep_samples_of_interest_d%d_s%d' % (res_block_i, dilation, stack_i))(skip_x)\n\n        res_x = keras.layers.Add()([original_x, res_x])\n\n        return res_x, skip_x\n"
  },
  {
    "path": "separate.py",
    "content": "# A Wavenet For Source Separation - Francesc Lluis - 25.10.2018\n# Separate.py\n\nfrom __future__ import division\nimport os\nimport util\nimport tqdm\nimport numpy as np\n\n\ndef separate_sample(model, input, batch_size, output_filename_prefix, sample_rate, output_path, target):\n\n    if target == 'singing-voice':\n\n        if len(input['mixture']) < model.receptive_field_length:\n            raise ValueError('Input is not long enough to be used with this model.')\n\n        num_output_samples = input['mixture'].shape[0] - (model.receptive_field_length - 1)\n        num_fragments = int(np.ceil(num_output_samples / model.target_field_length))\n        num_batches = int(np.ceil(num_fragments / batch_size))\n\n        vocals_output = []\n        num_pad_values = 0\n        fragment_i = 0\n        for batch_i in tqdm.tqdm(range(0, num_batches)):\n\n            if batch_i == num_batches - 1:  # If its the last batch\n                batch_size = num_fragments - batch_i * batch_size\n\n            input_batch = np.zeros((batch_size, model.input_length))\n\n            # Assemble batch\n            for batch_fragment_i in range(0, batch_size):\n\n                if fragment_i + model.target_field_length > num_output_samples:\n                    remainder = input['mixture'][fragment_i:]\n                    current_fragment = np.zeros((model.input_length,))\n                    current_fragment[:remainder.shape[0]] = remainder\n                    num_pad_values = model.input_length - remainder.shape[0]\n                else:\n                    current_fragment = input['mixture'][fragment_i:fragment_i + model.input_length]\n\n                input_batch[batch_fragment_i, :] = current_fragment\n                fragment_i += model.target_field_length\n\n            separated_output_fragments = model.separate_batch({'data_input': input_batch})\n\n            if type(separated_output_fragments) is list:\n                vocals_output_fragment = separated_output_fragments[0]\n\n            vocals_output_fragment = vocals_output_fragment[:,\n                                     model.target_padding: model.target_padding + model.target_field_length]\n            vocals_output_fragment = vocals_output_fragment.flatten().tolist()\n\n            if type(separated_output_fragments) is float:\n                vocals_output_fragment = [vocals_output_fragment]\n\n            vocals_output = vocals_output + vocals_output_fragment\n\n        vocals_output = np.array(vocals_output)\n\n        if num_pad_values != 0:\n            vocals_output = vocals_output[:-num_pad_values]\n\n        mixture_valid_signal = input['mixture'][\n                               model.half_receptive_field_length:model.half_receptive_field_length + len(vocals_output)]\n\n        accompaniment_output = mixture_valid_signal - vocals_output\n\n        output_vocals_filename = output_filename_prefix + '_vocals.wav'\n        output_accompaniment_filename = output_filename_prefix + '_accompaniment.wav'\n\n        output_vocals_filepath = os.path.join(output_path, output_vocals_filename)\n        output_accompaniment_filepath = os.path.join(output_path, output_accompaniment_filename)\n\n        util.write_wav(vocals_output, output_vocals_filepath, sample_rate)\n        util.write_wav(accompaniment_output, output_accompaniment_filepath, sample_rate)\n\n    if target == 'multi-instrument':\n\n        if len(input['mixture']) < model.receptive_field_length:\n            raise ValueError('Input is not long enough to be used with this model.')\n\n        num_output_samples = input['mixture'].shape[0] - (model.receptive_field_length - 1)\n        num_fragments = int(np.ceil(num_output_samples / model.target_field_length))\n        num_batches = int(np.ceil(num_fragments / batch_size))\n\n        vocals_output = []\n        drums_output = []\n        bass_output = []\n\n        num_pad_values = 0\n        fragment_i = 0\n        for batch_i in tqdm.tqdm(range(0, num_batches)):\n\n            if batch_i == num_batches - 1:  # If its the last batch\n                batch_size = num_fragments - batch_i * batch_size\n\n            input_batch = np.zeros((batch_size, model.input_length))\n\n            # Assemble batch\n            for batch_fragment_i in range(0, batch_size):\n\n                if fragment_i + model.target_field_length > num_output_samples:\n                    remainder = input['mixture'][fragment_i:]\n                    current_fragment = np.zeros((model.input_length,))\n                    current_fragment[:remainder.shape[0]] = remainder\n                    num_pad_values = model.input_length - remainder.shape[0]\n                else:\n                    current_fragment = input['mixture'][fragment_i:fragment_i + model.input_length]\n\n                input_batch[batch_fragment_i, :] = current_fragment\n                fragment_i += model.target_field_length\n\n            separated_output_fragments = model.separate_batch({'data_input': input_batch})\n\n            if type(separated_output_fragments) is list:\n                vocals_output_fragment = separated_output_fragments[0]\n                drums_output_fragment = separated_output_fragments[1]\n                bass_output_fragment = separated_output_fragments[2]\n\n            vocals_output_fragment = vocals_output_fragment[:,\n                                     model.target_padding: model.target_padding + model.target_field_length]\n            vocals_output_fragment = vocals_output_fragment.flatten().tolist()\n\n            drums_output_fragment = drums_output_fragment[:,\n                                    model.target_padding: model.target_padding + model.target_field_length]\n            drums_output_fragment = drums_output_fragment.flatten().tolist()\n\n            bass_output_fragment = bass_output_fragment[:,\n                                   model.target_padding: model.target_padding + model.target_field_length]\n            bass_output_fragment = bass_output_fragment.flatten().tolist()\n\n            if type(separated_output_fragments) is float:\n                vocals_output_fragment = [vocals_output_fragment]\n            if type(drums_output_fragment) is float:\n                drums_output_fragment = [drums_output_fragment]\n            if type(bass_output_fragment) is float:\n                bass_output_fragment = [bass_output_fragment]\n\n            vocals_output = vocals_output + vocals_output_fragment\n            drums_output = drums_output + drums_output_fragment\n            bass_output = bass_output + bass_output_fragment\n\n        vocals_output = np.array(vocals_output)\n        drums_output = np.array(drums_output)\n        bass_output = np.array(bass_output)\n\n        if num_pad_values != 0:\n            vocals_output = vocals_output[:-num_pad_values]\n            drums_output = drums_output[:-num_pad_values]\n            bass_output = bass_output[:-num_pad_values]\n\n        mixture_valid_signal = input['mixture'][\n                               model.half_receptive_field_length:model.half_receptive_field_length + len(vocals_output)]\n\n        other_output = mixture_valid_signal - vocals_output - drums_output - bass_output\n\n        output_vocals_filename = output_filename_prefix + '_vocals.wav'\n        output_drums_filename = output_filename_prefix + '_drums.wav'\n        output_bass_filename = output_filename_prefix + '_bass.wav'\n        output_other_filename = output_filename_prefix + '_other.wav'\n\n        output_vocals_filepath = os.path.join(output_path, output_vocals_filename)\n        output_drums_filepath = os.path.join(output_path, output_drums_filename)\n        output_bass_filepath = os.path.join(output_path, output_bass_filename)\n        output_other_filepath = os.path.join(output_path, output_other_filename)\n\n        util.write_wav(vocals_output, output_vocals_filepath, sample_rate)\n        util.write_wav(drums_output, output_drums_filepath, sample_rate)\n        util.write_wav(bass_output, output_bass_filepath, sample_rate)\n        util.write_wav(other_output, output_other_filepath, sample_rate)"
  },
  {
    "path": "sessions/multi-instrument/config.json",
    "content": "{\n    \"dataset\": {\n        \"extract_voice_percentage\": 0,\n        \"in_memory_percentage\": 1,\n        \"path\": \"MUS\",\n        \"sample_rate\": 16000,\n        \"type\": \"musdb18\"\n    },\n    \"model\": {\n        \"condition_encoding\": \"binary\",\n        \"dilations\": 9,\n        \"filters\": {\n            \"depths\": {\n                \"final\": [\n                    2048,\n                    256\n                ],\n                \"res\": 64,\n                \"skip\": 64\n            },\n            \"lengths\": {\n                \"final\": [\n                    3,\n                    3\n                ],\n                \"res\": 3,\n                \"skip\": 1\n            }\n        },\n        \"input_length\": 9785,\n        \"num_params\": 3277763,\n        \"num_residual_blocks\": 40,\n        \"num_stacks\": 4,\n        \"receptive_field_length\": 8185,\n        \"target_field_length\": 1601,\n        \"target_padding\": 1,\n        \"type\": \"multi-instrument\"\n    },\n    \"optimizer\": {\n        \"decay\": 0.0,\n        \"epsilon\": 1e-08,\n        \"lr\": 0.001,\n        \"momentum\": 0.9,\n        \"type\": \"adam\"\n    },\n    \"training\": {\n        \"batch_size\": 10,\n        \"early_stopping_patience\": 16,\n        \"loss\": {\n            \"out_1\": {\n                \"l1\": 1,\n                \"l2\": 0,\n                \"weight\": 1\n            },\n            \"out_2\": {\n                \"l1\": 1,\n                \"l2\": 0,\n                \"weight\": 1\n            },\n            \"out_3\": {\n                \"l1\": 1,\n                \"l2\": 0,\n                \"weight\": 1\n            }\n        },\n        \"num_epochs\": 250,\n        \"num_steps_test\": 500,\n        \"num_steps_train\": 2000,\n        \"path\": \"sessions/multi-instrument\",\n        \"verbosity\": 1\n    }\n}\n"
  },
  {
    "path": "sessions/singing-voice/config.json",
    "content": "{\n    \"dataset\": {\n        \"extract_voice_percentage\": 0.5,\n        \"in_memory_percentage\": 1,\n        \"path\": \"data/MUS\",\n        \"sample_rate\": 16000,\n        \"type\": \"musdb18\"\n    },\n    \"model\": {\n        \"condition_encoding\": \"binary\",\n        \"dilations\": 9,\n        \"filters\": {\n            \"depths\": {\n                \"final\": [\n                    2048,\n                    256\n                ],\n                \"res\": 64,\n                \"skip\": 64\n            },\n            \"lengths\": {\n                \"final\": [\n                    3,\n                    3\n                ],\n                \"res\": 3,\n                \"skip\": 1\n            }\n        },\n        \"input_length\": 9785,\n        \"num_params\": 3277249,\n        \"num_residual_blocks\": 40,\n        \"num_stacks\": 4,\n        \"receptive_field_length\": 8185,\n        \"target_field_length\": 1601,\n        \"target_padding\": 1,\n        \"type\": \"singing-voice\"\n    },\n    \"optimizer\": {\n        \"decay\": 0.0,\n        \"epsilon\": 1e-08,\n        \"lr\": 0.001,\n        \"momentum\": 0.9,\n        \"type\": \"adam\"\n    },\n    \"training\": {\n        \"batch_size\": 10,\n        \"early_stopping_patience\": 16,\n        \"loss\": {\n            \"out_1\": {\n                \"l1\": 1,\n                \"l2\": 0,\n                \"weight\": 1\n            },\n            \"out_2\": {\n                \"l1\": 1,\n                \"l2\": 0,\n                \"weight\": -0.05\n            }\n        },\n        \"num_epochs\": 250,\n        \"num_steps_test\": 500,\n        \"num_steps_train\": 2000,\n        \"path\": \"sessions/singing-voice\",\n        \"verbosity\": 1\n    }\n}\n"
  },
  {
    "path": "util.py",
    "content": "# A Wavenet For Source Separation - Francesc Lluis - 25.10.2018\n# Util.py\n# Utility functions for dealing with audio signals and training a Source Separation Wavenet\n\nimport os\nimport numpy as np\nimport json\nimport warnings\nimport scipy.signal\nimport scipy.stats\nimport soundfile as sf\nimport keras\nimport glob\n\n\ndef l1_l2_loss(y_true, y_pred, l1_weight, l2_weight):\n\n    loss = 0\n\n    if l1_weight != 0:\n        loss += l1_weight*keras.losses.mean_absolute_error(y_true, y_pred)\n\n    if l2_weight != 0:\n        loss += l2_weight * keras.losses.mean_squared_error(y_true, y_pred)\n\n    return loss\n\n\ndef compute_receptive_field_length(stacks, dilations, filter_length, target_field_length):\n\n    half_filter_length = (filter_length-1)/2\n    length = 0\n    for d in dilations:\n        length += d*half_filter_length\n    length = 2*length\n    length = stacks * length\n    length += target_field_length\n    return length\n\n\ndef wav_to_float(x):\n\n    try:\n        max_value = np.iinfo(x.dtype).max\n        min_value = np.iinfo(x.dtype).min\n    except:\n        max_value = np.finfo(x.dtype).max\n        min_value = np.finfo(x.dtype).min\n    x = x.astype('float64', casting='safe')\n    x -= min_value\n    x /= ((max_value - min_value) / 2.)\n    x -= 1.\n    return x\n\n\ndef float_to_uint8(x):\n\n    x += 1.\n    x /= 2.\n    uint8_max_value = np.iinfo('uint8').max\n    x *= uint8_max_value\n    x = x.astype('uint8')\n    return x\n\n\ndef keras_float_to_uint8(x):\n\n    x += 1.\n    x /= 2.\n    uint8_max_value = 255\n    x *= uint8_max_value\n    return x\n\n\ndef linear_to_ulaw(x, u=255):\n\n    x = np.sign(x) * (np.log(1 + u * np.abs(x)) / np.log(1 + u))\n    return x\n\n\ndef keras_linear_to_ulaw(x, u=255.0):\n\n    x = keras.backend.sign(x) * (keras.backend.log(1 + u * keras.backend.abs(x)) / keras.backend.log(1 + u))\n    return x\n\n\ndef uint8_to_float(x):\n\n    max_value = np.iinfo('uint8').max\n    min_value = np.iinfo('uint8').min\n    x = x.astype('float32', casting='unsafe')\n    x -= min_value\n    x /= ((max_value - min_value) / 2.)\n    x -= 1.\n    return x\n\n\ndef keras_uint8_to_float(x):\n\n    max_value = 255\n    min_value = 0\n    x -= min_value\n    x /= ((max_value - min_value) / 2.)\n    x -= 1.\n    return x\n\n\ndef ulaw_to_linear(x, u=255.0):\n\n    y = np.sign(x) * (1 / float(u)) * (((1 + float(u)) ** np.abs(x)) - 1)\n    return y\n\n\ndef keras_ulaw_to_linear(x, u=255.0):\n\n    y = keras.backend.sign(x) * (1 / u) * (((1 + u) ** keras.backend.abs(x)) - 1)\n    return y\n\n\ndef one_hot_encode(x, num_values=256):\n\n    if isinstance(x, int):\n        x = np.array([x])\n    if isinstance(x, list):\n        x = np.array(x)\n    return np.eye(num_values, dtype='uint8')[x.astype('uint8')]\n\n\ndef one_hot_decode(x):\n\n    return np.argmax(x, axis=-1)\n\n\ndef preemphasis(signal, alpha=0.95):\n\n    return np.append(signal[0], signal[1:] - alpha * signal[:-1])\n\n\ndef binary_encode(x, max_value):\n\n    if isinstance(x, int):\n        x = np.array([x])\n    if isinstance(x, list):\n        x = np.array(x)\n    width = np.ceil(np.log2(max_value)).astype(int)\n    return (((x[:, None] & (1 << np.arange(width)))) > 0).astype(int)\n\n\ndef get_condition_input_encode_func(representation):\n\n        if representation == 'binary':\n            return binary_encode\n        else:\n            return one_hot_encode\n\n\ndef ensure_keys_in_dict(keys, dictionary):\n\n    if all (key in dictionary for key in keys):\n        return True\n    return False\n\n\ndef get_subdict_from_dict(keys, dictionary):\n\n    return dict((k, dictionary[k]) for k in keys if k in dictionary)\n\n\ndef pretty_json_dump(values, file_path=None):\n\n    if file_path is None:\n        print json.dumps(values, sort_keys=True, indent=4, separators=(',', ': '))\n    else:\n        json.dump(values, open(file_path, 'w'), sort_keys=True, indent=4, separators=(',', ': '))\n\n\ndef read_wav(filename):\n    # Reads in a wav audio file, averages both if stereo, converts the signal to float64 representation\n\n    audio_signal, sample_rate = sf.read(filename)\n\n    if audio_signal.ndim > 1:\n        audio_signal = (audio_signal[:, 0] + audio_signal[:, 1])/2.0\n\n    if audio_signal.dtype != 'float64':\n        audio_signal = wav_to_float(audio_signal)\n\n    return audio_signal, sample_rate\n\n\ndef load_wav(wav_path, desired_sample_rate):\n\n    sequence, sample_rate = read_wav(wav_path)\n    sequence = ensure_sample_rate(sequence, desired_sample_rate, sample_rate)\n    return sequence\n\n\ndef write_wav(x, filename, sample_rate):\n\n    if type(x) != np.ndarray:\n        x = np.array(x)\n\n    with warnings.catch_warnings():\n        warnings.simplefilter(\"error\")\n        sf.write(filename, x, sample_rate)\n\n\ndef ensure_sample_rate(x, desired_sample_rate, file_sample_rate):\n\n    if file_sample_rate != desired_sample_rate:\n        return scipy.signal.resample_poly(x, desired_sample_rate, file_sample_rate)\n    return x\n\n\ndef normalize(x):\n    max_peak = np.max(np.abs(x))\n    return x / max_peak\n\n\ndef get_sequence_with_singing_indices(full_sequence):\n\n    signal_magnitude = np.abs(full_sequence)\n\n    chunk_length = 800\n\n    chunks_energies = []\n    for i in xrange(0, len(signal_magnitude), chunk_length):\n        chunks_energies.append(np.mean(signal_magnitude[i:i + chunk_length]))\n\n    threshold = np.max(chunks_energies) * .1\n    chunks_energies = np.asarray(chunks_energies)\n    chunks_energies[np.where(chunks_energies < threshold)] = 0\n    onsets = np.zeros(len(chunks_energies))\n    onsets[np.nonzero(chunks_energies)] = 1\n    onsets = np.diff(onsets)\n\n    start_ind = np.squeeze(np.where(onsets == 1))\n    finish_ind = np.squeeze(np.where(onsets == -1))\n\n    if finish_ind[0] < start_ind[0]:\n        finish_ind = finish_ind[1:]\n\n    if start_ind[-1] > finish_ind[-1]:\n        start_ind = start_ind[:-1]\n\n    indices_inici_final = np.insert(finish_ind, np.arange(len(start_ind)), start_ind)\n\n    return np.squeeze((np.asarray(indices_inici_final) + 1) * chunk_length)\n\n\ndef get_indices_subsequence(indices):\n\n    start_indice = 2 * np.random.randint(0, np.ceil(len(indices) / 2))\n    vocals_indices = (indices[start_indice], indices[start_indice + 1])\n    accompaniment_indices = vocals_indices\n\n    return vocals_indices, accompaniment_indices\n\n\ndef contains_voice(fragment, sequence):\n\n    signal_fragment_magnitude = np.abs(fragment)\n    signal_sequence_magnitude = np.abs(sequence)\n\n    chunk_length = 800\n\n    chunks_fragment_energies = []\n    for i in xrange(0, len(signal_fragment_magnitude), chunk_length):\n        chunks_fragment_energies.append(np.mean(signal_fragment_magnitude[i:i + chunk_length]))\n\n    chunks_sequence_energies = []\n    for i in xrange(0, len(signal_sequence_magnitude), chunk_length):\n        chunks_sequence_energies.append(np.mean(signal_sequence_magnitude[i:i + chunk_length]))\n\n    threshold = np.max(chunks_sequence_energies) * .1\n    chunks_fragment_energies = np.asarray(chunks_fragment_energies)\n    chunks_fragment_energies[np.where(chunks_fragment_energies < threshold)] = 0\n\n    if np.count_nonzero(chunks_fragment_energies) > 0:\n        return True\n    else:\n        return False\n\n\ndef dir_contains_files(path):\n\n    for f in os.listdir(path):\n        if not f.startswith('.'):\n            return True\n    return False\n"
  }
]