Repository: eborboihuc/SoundNet-tensorflow Branch: master Commit: b603cd4584a9 Files: 12 Total size: 36.7 KB Directory structure: gitextract_78ltv8rb/ ├── .gitignore ├── LICENSE ├── README.md ├── cmp.py ├── demo.txt ├── extract_feat.py ├── h5convert.py ├── load_t7.py ├── main.py ├── model.py ├── ops.py └── util.py ================================================ FILE CONTENTS ================================================ ================================================ FILE: .gitignore ================================================ # Data data/* *.zip output models/* # checkpoint *logs *checkpoint # trash .dropbox # Created by https://www.gitignore.io/api/python,vim ### Python ### # Byte-compiled / optimized / DLL files __pycache__/ *.py[cod] *$py.class # C extensions *.so # Distribution / packaging .Python env/ build/ develop-eggs/ dist/ downloads/ eggs/ .eggs/ lib/ lib64/ parts/ sdist/ var/ *.egg-info/ .installed.cfg *.egg # PyInstaller # Usually these files are written by a python script from a template # before PyInstaller builds the exe, so as to inject date/other infos into it. *.manifest *.spec # Installer logs pip-log.txt pip-delete-this-directory.txt # Unit test / coverage reports htmlcov/ .tox/ .coverage .coverage.* .cache nosetests.xml coverage.xml *,cover .hypothesis/ # Translations *.mo *.pot # Django stuff: *.log # Sphinx documentation docs/_build/ # PyBuilder target/ ### Vim ### [._]*.s[a-w][a-z] [._]s[a-w][a-z] *.un~ Session.vim .netrwhist *~ ================================================ FILE: LICENSE ================================================ MIT License Copyright (c) 2018 Hou-Ning Hu Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ================================================ FILE: README.md ================================================ # SoundNet-tensorflow TensorFlow implementation of "SoundNet" that learns rich natural sound representations. Code for paper "[SoundNet: Learning Sound Representations from Unlabeled Video](https://arxiv.org/abs/1610.09001)" by Yusuf Aytar, Carl Vondrick, Antonio Torralba. NIPS 2016 ![from soundnet](https://camo.githubusercontent.com/0b88af5c13ba987a17dcf90cd58816cf8ef04554/687474703a2f2f70726f6a656374732e637361696c2e6d69742e6564752f736f756e646e65742f736f756e646e65742e6a7067) # Prerequisites - Linux - NVIDIA GPU + CUDA 8.0 + CuDNNv5.1 - Python 2.7 with numpy or Python 3.5 - [Tensorflow](https://www.tensorflow.org/) 1.0.0 (up to 1.3.0) - librosa # Getting Started - Clone this repo: ```bash git clone git@github.com:eborboihuc/SoundNet-tensorflow.git cd SoundNet-tensorflow ``` - Pretrained Model I provide pre-trained models that are ported from [soundnet](http://data.csail.mit.edu/soundnet/soundnet_models_public.zip). You can download the 8 layer model [here](https://drive.google.com/uc?export=download&id=0B9wE6h4m--wjR015M1RLZW45OEU). Please place it as `./models/sound8.npy` in your folder. - Data Prepare you input mp3 files and place them under `./data/` Generate a input file txt and place it under `./` ```txt ./data/0001.mp3 ./data/0002.mp3 ./data/0003.mp3 ... ``` Follow the steps in [extract features](#feature-extraction) - NOTE If you found out that [some audio with offset value `start` in FFMPEG will cause a tremendous difference between `torch audio` and `librosa`](#FAQs), please **convert it** with following command. ``` sox {input.mp3} {output.mp3} trim 0 ``` After this, the result might be much better. # Demo For demo, you can follow the following steps i) Download a converted npy file [demo.npy](https://drive.google.com/uc?export=download&id=0B9wE6h4m--wjcEtqQ3VIM1pvZ3c) and place it under `./data/` ii) To extract multiple features from a pretrained model with torch `lua audio` loaded sound track: The sound track is equivalent with torch version. ```bash python extract_feat.py -m {start layer number} -x {end layer numbe} -s ``` Then you can compare the outputs with torch ones. # Feature Extraction ## Minimum example i) Download input file [demo.mp3](https://drive.google.com/uc?export=download&id=0B9wE6h4m--wjTjVEWVI3dnBsTG8) and place it under `./data/` ii) Prepare a file list in `txt` format (`demo.txt`) that includes the input mp3 file(s) and place it under `./` ```txt ./data/demo.mp3 ``` iii) Then extract features from raw wave in `demo.txt`: Please put the demo mp3 under ./data/[demo.mp3](https://drive.google.com/uc?export=download&id=0B9wE6h4m--wjTjVEWVI3dnBsTG8) ```bash python extract_feat.py -m {start layer number} -x {end layer numbe} -s -p extract -t demo.txt ``` ## More options To extract multiple features from a pretrained model with downloaded mp3 dataset: ```bash python extract_feat.py -t {dataset_txt_name} -m {start layer number} -x {end layer numbe} -s -p extract ``` e.g. extract layer 4 to layer 17 and save as `./sound_out/tf_fea%02d.npy`: ```bash python extract_feat.py -o sound_out -m 4 -x 17 -s -p extract ``` More details are in: ```bash python extract_feat.py -h ``` # Finetuning To train from an existing model: ```bash python main.py ``` # Training To train from scratch: ```bash python main.py -p train ``` To extract features: ```bash python main.py -p extract -m {start layer number} -x {end layer numbe} -s ``` More details are in: ```bash python main.py -h ``` # TODOs - [x] Change audio loader to soundnet format - [x] Make it compatible to Python 3 format - [ ] Batch Norm behaviour different from Torch - [ ] Fix conv8 padding issue in training phase - [ ] Change all `config` into `tf.app.flags` - [ ] Change dummy distribution of scene and object to useful placeholder - [ ] Add sound and feature loader from [Data](https://projects.csail.mit.edu/soundnet/) section # Known issues - Loaded audio length is not consist in `torch7 audio` and `librosa`. Here is the [issue](https://github.com/soumith/lua---audio/issues/17#issuecomment-288648237) - Training with a short length audio will make conv8 complain about [output size would be negative](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/framework/common_shape_fns.cc#L45) # FAQs - Why my loaded sound wave is different from `torch7 audio` to `librosa`: Here is my [WiKi](https://github.com/eborboihuc/SoundNet-tensorflow/wiki/info.md) # Acknowledgments Code ported from [soundnet](https://github.com/cvondrick/soundnet). And Torch7-Tensorflow loader are from [tf_videogan](https://github.com/Yuliang-Zou/tf_videogan). Thanks for their excellent work! ## Author Hou-Ning Hu / [@eborboihuc](https://eborboihuc.github.io/) ================================================ FILE: cmp.py ================================================ import numpy as np import sys name = sys.argv[1] dec = int(sys.argv[2]) if len(sys.argv) >= 3 else 4 th = np.load('output/demo_th.npy', encoding='latin1').item()['layer{}'.format(name)].T tf = np.load('output/tf_fea{}.npy'.format(str(name).zfill(2)), encoding='latin1') if name == '25': tf = np.concatenate([tf, np.load('output/tf_fea26.npy', encoding='latin1')], 1) print('Layer {}: tf.shape={}, th.shape={}'.format(name, tf.shape, th.shape)) print('TF:') print(tf) print('Torch:') print(th) size = tf.shape[0] if tf.shape[0] < th.shape[0] else th.shape[0] print('Round to {} decimals'.format(dec)) tf = np.round(tf, decimals=dec) th = np.round(th, decimals=dec) print('Total Diff: {} Max Diff: {} Min Diff: {}'.format( np.sum(abs(tf[:size] - th[:size])), \ np.max(tf[:size] - th[:size]), \ np.min(tf[:size] - th[:size]))) ================================================ FILE: demo.txt ================================================ data/demo.mp3 ================================================ FILE: extract_feat.py ================================================ # TensorFlow version of NIPS2016 soundnet from util import load_from_txt from model import Model import tensorflow as tf import numpy as np import argparse import sys import os # Make xrange compatible in both Python 2, 3 try: xrange except NameError: xrange = range local_config = { 'batch_size': 1, 'eps': 1e-5, 'sample_rate': 22050, 'load_size': 22050*20, 'name_scope': 'SoundNet', 'phase': 'extract', } def parse_args(): """ Parse input arguments """ parser = argparse.ArgumentParser(description='Extract Feature') parser.add_argument('-t', '--txt', dest='audio_txt', help='target audio txt path. e.g., [demo.txt]', default='demo.txt') parser.add_argument('-o', '--outpath', dest='outpath', help='output feature path. e.g., [output]', default='output') parser.add_argument('-p', '--phase', dest='phase', help='demo or extract feature. e.g., [demo, extract]', default='demo') parser.add_argument('-m', '--layer', dest='layer_min', help='start from which feature layer. e.g., [1]', type=int, default=1) parser.add_argument('-x', dest='layer_max', help='end at which feature layer. e.g., [24]', type=int, default=None) parser.add_argument('-c', '--cuda', dest='cuda_device', help='which cuda device to use. e.g., [0]', default='0') feature_parser = parser.add_mutually_exclusive_group(required=False) feature_parser.add_argument('-s', '--save', dest='is_save', help='Turn on save mode. [False(default), True]', action='store_true') parser.set_defaults(is_save=False) args = parser.parse_args() return args def extract_feat(model, sound_input, config): layer_min = config.layer_min layer_max = config.layer_max if config.layer_max is not None else layer_min + 1 # Extract feature features = {} feed_dict = {model.sound_input_placeholder: sound_input} for idx in xrange(layer_min, layer_max): feature = model.sess.run(model.layers[idx], feed_dict=feed_dict) features[idx] = feature if config.is_save: np.save(os.path.join(config.outpath, 'tf_fea{}.npy'.format( \ str(idx).zfill(2))), np.squeeze(feature)) print("Save layer {} with shape {} as {}/tf_fea{}.npy".format( \ idx, np.squeeze(feature).shape, config.outpath, str(idx).zfill(2))) return features if __name__ == '__main__': args = parse_args() # Setup visible device os.environ["CUDA_VISIBLE_DEVICES"] = args.cuda_device # Load pre-trained model G_name = './models/sound8.npy' param_G = np.load(G_name, encoding = 'latin1').item() if args.phase == 'demo': # Demo sound_samples = [np.reshape(np.load('data/demo.npy', encoding='latin1'), [1, -1, 1, 1])] else: # Extract Feature sound_samples = load_from_txt(args.audio_txt, config=local_config) # Make path if not os.path.exists(args.outpath): os.mkdir(args.outpath) # Init. Session sess_config = tf.ConfigProto() sess_config.allow_soft_placement=True sess_config.gpu_options.allow_growth = True with tf.Session(config=sess_config) as session: # Build model model = Model(session, config=local_config, param_G=param_G) init = tf.global_variables_initializer() session.run(init) model.load() for sound_sample in sound_samples: output = extract_feat(model, sound_sample, args) ================================================ FILE: h5convert.py ================================================ import numpy as np import h5py import sys th = h5py.File(sys.argv[1], 'r') print th.keys() if len(th.keys()) <= 1: key = th.keys()[0] npy = np.array(th[key]) else: npy = {} for key in th.keys(): npy[key] = np.array(th[key]) np.save(sys.argv[2], npy) ================================================ FILE: load_t7.py ================================================ # Load t7 files # Required package: torchfile. # $ pip install torchfile import torchfile import numpy as np import pdb # Make xrange compatible in both Python 2, 3 try: xrange except NameError: xrange = range keys = ['conv1', 'conv2', 'conv3', 'conv4', 'conv5', 'conv6', 'conv7', 'conv8', 'conv8_2'] def load(o, param_list): """ Get torch7 weights into numpy array """ try: num = len(o['modules']) except: num = 0 for i in xrange(num): # 2D conv if o['modules'][i]._typename == 'nn.SpatialConvolution' or \ o['modules'][i]._typename == 'cudnn.SpatialConvolution': temp = {'weights': o['modules'][i]['weight'].transpose((2,3,1,0)), 'biases': o['modules'][i]['bias']} param_list.append(temp) # 2D deconv elif o['modules'][i]._typename == 'nn.SpatialFullConvolution': temp = {'weights': o['modules'][i]['weight'].transpose((2,3,1,0)), 'biases': o['modules'][i]['bias']} param_list.append(temp) # 3D conv elif o['modules'][i]._typename == 'nn.VolumetricFullConvolution': temp = {'weights': o['modules'][i]['weight'].transpose((2,3,4,1,0)), 'biases': o['modules'][i]['bias']} param_list.append(temp) # batch norm elif o['modules'][i]._typename == 'nn.SpatialBatchNormalization' or \ o['modules'][i]._typename == 'nn.VolumetricBatchNormalization': param_list[-1]['gamma'] = o['modules'][i]['weight'] param_list[-1]['beta'] = o['modules'][i]['bias'] param_list[-1]['mean'] = o['modules'][i]['running_mean'] param_list[-1]['var'] = o['modules'][i]['running_var'] load(o['modules'][i], param_list) def show(o): """ Show nn information """ nn = {} nn_keys = {} nn_info = {} num = len(o['modules']) if o['modules'] else 0 mylist = get_mylist() for i in xrange(num): # Get _obj and keys from torchfile nn[i] = o['modules'][i]._obj nn_keys[i] = o['modules'][i]._obj.keys() # Get information from _obj # {layer i: {mylist keys: value}} nn_info[i] = {key: nn[i][key] for key in sorted(nn_keys[i]) if key in mylist} nn_info[i]['name'] = o['modules'][i]._typename print(i, nn_info[i]['name']) for item in sorted(nn_info[i].keys()): print(" {}:{}".format(item, nn_info[i][item] if 'running' not in item \ else nn_info[i][item].shape)) def get_mylist(): """ Return manually selected information lists """ return ['_type', 'nInputPlane', 'nOutputPlane', \ 'input_offset', 'groups', 'dH', 'dW', \ 'padH', 'padW', 'kH', 'kW', 'iSize', \ 'running_mean', 'running_var'] if __name__ == '__main__': # File loader t7_file = './models/soundnet8_final.t7' o = torchfile.load(t7_file) # To show nn parameter show(o) # To store as npy file param_list = [] load(o, param_list) save_list = {} for i, k in enumerate(keys): save_list[k] = param_list[i] np.save('sound8', save_list) ================================================ FILE: main.py ================================================ # TensorFlow version of NIPS2016 soundnet # Required package: librosa: A python package for music and audio analysis. # $ pip install librosa from ops import batch_norm, conv2d, relu, maxpool from util import preprocess, load_from_list, load_audio from model import Model from glob import glob import tensorflow as tf import numpy as np import argparse import time import sys import os # Make xrange compatible in both Python 2, 3 try: xrange except NameError: xrange = range local_config = { 'batch_size': 1, 'train_size': np.inf, 'epoch': 200, 'eps': 1e-5, 'learning_rate': 1e-3, 'beta1': 0.9, 'load_size': 22050*4, 'sample_rate': 22050, 'name_scope': 'SoundNet', 'phase': 'train', 'dataset_name': 'ESC50', 'subname': 'mp3', 'checkpoint_dir': 'checkpoint', 'dump_dir': 'output', 'model_dir': None, 'param_g_dir': './models/sound8.npy', } class Model(): def __init__(self, session, config=local_config, param_G=None): self.sess = session self.config = config self.param_G = param_G self.g_step = tf.Variable(0, trainable=False) self.counter = 0 self.model() def model(self): # Placeholder self.sound_input_placeholder = tf.placeholder(tf.float32, shape=[self.config['batch_size'], None, 1, 1]) # batch x h x w x channel self.object_dist = tf.placeholder(tf.float32, shape=[self.config['batch_size'], None, 1000]) # batch x h x w x channel self.scene_dist = tf.placeholder(tf.float32, shape=[self.config['batch_size'], None, 401]) # batch x h x w x channel # Generator self.add_generator(name_scope=self.config['name_scope']) # KL Divergence self.object_loss = self.KL_divergence(self.layers[25], self.object_dist, name_scope='KL_Div_object') self.scene_loss = self.KL_divergence(self.layers[26], self.scene_dist, name_scope='KL_Div_scene') self.loss = self.object_loss + self.scene_loss # Summary self.loss_sum = tf.summary.scalar("g_loss", self.loss) self.g_sum = tf.summary.merge([self.loss_sum]) self.writer = tf.summary.FileWriter("./logs", self.sess.graph) # variable collection self.g_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope=self.config['name_scope']) self.saver = tf.train.Saver(keep_checkpoint_every_n_hours=12, max_to_keep=5, restore_sequentially=True) # Optimizer and summary self.g_optim = tf.train.AdamOptimizer(self.config['learning_rate'], beta1=self.config['beta1']) \ .minimize(self.loss, var_list=(self.g_vars), global_step=self.g_step) # Initialize init_op = tf.global_variables_initializer() self.sess.run(init_op) # Load checkpoint if self.load(self.config['checkpoint_dir']): print(" [*] Load SUCCESS") else: print(" [!] Load failed...") def add_generator(self, name_scope='SoundNet'): with tf.variable_scope(name_scope) as scope: self.layers = {} # Stream one: conv1 ~ conv7 self.layers[1] = conv2d(self.sound_input_placeholder, 1, 16, k_h=64, d_h=2, p_h=32, name_scope='conv1') self.layers[2] = batch_norm(self.layers[1], 16, self.config['eps'], name_scope='conv1') self.layers[3] = relu(self.layers[2], name_scope='conv1') self.layers[4] = maxpool(self.layers[3], k_h=8, d_h=8, name_scope='conv1') self.layers[5] = conv2d(self.layers[4], 16, 32, k_h=32, d_h=2, p_h=16, name_scope='conv2') self.layers[6] = batch_norm(self.layers[5], 32, self.config['eps'], name_scope='conv2') self.layers[7] = relu(self.layers[6], name_scope='conv2') self.layers[8] = maxpool(self.layers[7], k_h=8, d_h=8, name_scope='conv2') self.layers[9] = conv2d(self.layers[8], 32, 64, k_h=16, d_h=2, p_h=8, name_scope='conv3') self.layers[10] = batch_norm(self.layers[9], 64, self.config['eps'], name_scope='conv3') self.layers[11] = relu(self.layers[10], name_scope='conv3') self.layers[12] = conv2d(self.layers[11], 64, 128, k_h=8, d_h=2, p_h=4, name_scope='conv4') self.layers[13] = batch_norm(self.layers[12], 128, self.config['eps'], name_scope='conv4') self.layers[14] = relu(self.layers[13], name_scope='conv4') self.layers[15] = conv2d(self.layers[14], 128, 256, k_h=4, d_h=2, p_h=2, name_scope='conv5') self.layers[16] = batch_norm(self.layers[15], 256, self.config['eps'], name_scope='conv5') self.layers[17] = relu(self.layers[16], name_scope='conv5') self.layers[18] = maxpool(self.layers[17], k_h=4, d_h=4, name_scope='conv5') self.layers[19] = conv2d(self.layers[18], 256, 512, k_h=4, d_h=2, p_h=2, name_scope='conv6') self.layers[20] = batch_norm(self.layers[19], 512, self.config['eps'], name_scope='conv6') self.layers[21] = relu(self.layers[20], name_scope='conv6') self.layers[22] = conv2d(self.layers[21], 512, 1024, k_h=4, d_h=2, p_h=2, name_scope='conv7') self.layers[23] = batch_norm(self.layers[22], 1024, self.config['eps'], name_scope='conv7') self.layers[24] = relu(self.layers[23], name_scope='conv7') # Split one: conv8, conv8_2 # NOTE: here we use a padding of 2 to skip an unknown error # https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/framework/common_shape_fns.cc#L45 self.layers[25] = conv2d(self.layers[24], 1024, 1000, k_h=8, d_h=2, p_h=2, name_scope='conv8') self.layers[26] = conv2d(self.layers[24], 1024, 401, k_h=8, d_h=2, p_h=2, name_scope='conv8_2') def train(self): """Train SoundNet""" start_time = time.time() # Data info data = glob('./data/*.{}'.format(self.config['subname'])) batch_idxs = min(len(data), self.config['train_size']) // self.config['batch_size'] for epoch in xrange(self.counter//batch_idxs, self.config['epoch']): for idx in xrange(self.counter%batch_idxs, batch_idxs): # By default, librosa will resample the signal to 22050Hz. And range in (-1., 1.) sound_sample = load_from_list(data[idx*self.config['batch_size']:(idx+1)*self.config['batch_size']], self.config) # Update G network # NOTE: Here we still use dummy random distribution for scene and objects _, summary_str, l_scn, l_obj = self.sess.run([self.g_optim, self.g_sum, self.scene_loss, self.object_loss], feed_dict={self.sound_input_placeholder: sound_sample, \ self.scene_dist: np.random.randint(2, size=(1, 1, 401)), \ self.object_dist: np.random.randint(2, size=(1, 1, 1000))}) self.writer.add_summary(summary_str, self.counter) print ("[Epoch {}] {}/{} | Time: {} | scene_loss: {} | obj_loss: {}".format(epoch, idx, batch_idxs, time.time() - start_time, l_scn, l_obj)) if np.mod(self.counter, 1000) == 1000 - 1: self.save(self.config['checkpoint_dir'], self.counter) self.counter += 1 ######################### # Loss # ######################### # Adapt the answer here: http://stackoverflow.com/questions/41863814/kl-divergence-in-tensorflow def KL_divergence(self, dist_a, dist_b, name_scope='KL_Div'): return tf.reduce_mean(-tf.nn.softmax_cross_entropy_with_logits(logits=dist_a, labels=dist_b)) ######################### # Save/Load # ######################### @property def get_model_dir(self): if self.config['model_dir'] is None: return "{}_{}".format( self.config['dataset_name'], self.config['batch_size']) else: return self.config['model_dir'] def load(self, ckpt_dir='checkpoint'): return self.load_from_ckpt(ckpt_dir) if self.param_G is None \ else self.load_from_npy() def save(self, checkpoint_dir, step): """ Checkpoint saver """ model_name = "SoundNet.model" checkpoint_dir = os.path.join(checkpoint_dir, self.get_model_dir) if not os.path.exists(checkpoint_dir): os.makedirs(checkpoint_dir) self.saver.save(self.sess, os.path.join(checkpoint_dir, model_name), global_step=step) def load_from_ckpt(self, checkpoint_dir='checkpoint'): """ Checkpoint loader """ print(" [*] Reading checkpoints...") checkpoint_dir = os.path.join(checkpoint_dir, self.get_model_dir) ckpt = tf.train.get_checkpoint_state(checkpoint_dir) if ckpt and ckpt.model_checkpoint_path: ckpt_name = os.path.basename(ckpt.model_checkpoint_path) self.saver.restore(self.sess, os.path.join(checkpoint_dir, ckpt_name)) print(" [*] Success to read {}".format(ckpt_name)) self.counter = int(ckpt_name.rsplit('-', 1)[-1]) print(" [*] Start counter from {}".format(self.counter)) return True else: print(" [*] Failed to find a checkpoint under {}".format(checkpoint_dir)) return False def load_from_npy(self): if self.param_G is None: return False data_dict = self.param_G for key in data_dict: with tf.variable_scope(self.config['name_scope'] + '/'+ key, reuse=True): for subkey in data_dict[key]: try: var = tf.get_variable(subkey) self.sess.run(var.assign(data_dict[key][subkey])) print('Assign pretrain model {} to {}'.format(subkey, key)) except: print('Ignore {}'.format(key)) self.param_G.clear() return True def main(): args = parse_args() local_config['phase'] = args.phase # Setup visible device os.environ["CUDA_VISIBLE_DEVICES"] = args.cuda_device # Make path if not os.path.exists(args.outpath): os.mkdir(args.outpath) # Load pre-trained model param_G = np.load(local_config['param_g_dir'], encoding='latin1').item() \ if args.phase in ['finetune', 'extract'] \ else None # Init. Session sess_config = tf.ConfigProto() sess_config.allow_soft_placement=True sess_config.gpu_options.allow_growth = True with tf.Session(config=sess_config) as session: # Build model model = Model(session, config=local_config, param_G=param_G) if args.phase in ['train', 'finetune']: # Training phase model.train() elif args.phase == 'extract': # import when we need from extract_feat import extract_feat # Feature extractor #sound_sample = np.reshape(np.load('./data/demo.npy', encoding='latin1'), [local_config['batch_size'], -1, 1, 1]) import librosa audio_path = './data/demo.mp3' sound_sample, _ = load_audio(audio_path) sound_sample = preprocess(sound_sample, config=local_config) output = extract_feat(model, sound_sample, args) def parse_args(): """ Parse input arguments """ parser = argparse.ArgumentParser(description='SoundNet') parser.add_argument('-o', '--outpath', dest='outpath', help='output feature path. e.g., [output]', default='output') parser.add_argument('-p', '--phase', dest='phase', help='demo or extract feature. e.g., [train, finetune, extract]', default='finetune') parser.add_argument('-m', '--layer', dest='layer_min', help='start from which feature layer. e.g., [1]', type=int, default=1) parser.add_argument('-x', dest='layer_max', help='end at which feature layer. e.g., [24]', type=int, default=None) parser.add_argument('-c', '--cuda', dest='cuda_device', help='which cuda device to use. e.g., [0]', default='0') feature_parser = parser.add_mutually_exclusive_group(required=False) feature_parser.add_argument('-s', '--save', dest='is_save', help='Turn on save mode. [False(default), True]', action='store_true') parser.set_defaults(is_save=False) args = parser.parse_args() return args if __name__ == '__main__': main() ================================================ FILE: model.py ================================================ # TensorFlow version of NIPS2016 soundnet import sys import numpy as np import tensorflow as tf from ops import batch_norm, conv2d, relu, maxpool # Make xrange compatible in both Python 2, 3 try: xrange except NameError: xrange = range local_config = { 'batch_size': 1, 'eps': 1e-5, 'name_scope': 'SoundNet', } class Model(): def __init__(self, session, config=local_config, param_G=None): # Print config for key in config: print("{}:{}".format(key, config[key])) self.sess = session self.config = config self.param_G = param_G # Placeholder self.add_placeholders() # Generator self.add_generator(name_scope=self.config['name_scope']) def add_placeholders(self): self.sound_input_placeholder = tf.placeholder(tf.float32, shape=[self.config['batch_size'], None, 1, 1]) # batch x h x w x channel def add_generator(self, name_scope='SoundNet'): with tf.variable_scope(name_scope) as scope: self.layers = {} # Stream one: conv1 ~ conv7 self.layers[1] = conv2d(self.sound_input_placeholder, 1, 16, k_h=64, d_h=2, p_h=32, name_scope='conv1') self.layers[2] = batch_norm(self.layers[1], 16, self.config['eps'], name_scope='conv1') self.layers[3] = relu(self.layers[2], name_scope='conv1') self.layers[4] = maxpool(self.layers[3], k_h=8, d_h=8, name_scope='conv1') self.layers[5] = conv2d(self.layers[4], 16, 32, k_h=32, d_h=2, p_h=16, name_scope='conv2') self.layers[6] = batch_norm(self.layers[5], 32, self.config['eps'], name_scope='conv2') self.layers[7] = relu(self.layers[6], name_scope='conv2') self.layers[8] = maxpool(self.layers[7], k_h=8, d_h=8, name_scope='conv2') self.layers[9] = conv2d(self.layers[8], 32, 64, k_h=16, d_h=2, p_h=8, name_scope='conv3') self.layers[10] = batch_norm(self.layers[9], 64, self.config['eps'], name_scope='conv3') self.layers[11] = relu(self.layers[10], name_scope='conv3') self.layers[12] = conv2d(self.layers[11], 64, 128, k_h=8, d_h=2, p_h=4, name_scope='conv4') self.layers[13] = batch_norm(self.layers[12], 128, self.config['eps'], name_scope='conv4') self.layers[14] = relu(self.layers[13], name_scope='conv4') self.layers[15] = conv2d(self.layers[14], 128, 256, k_h=4, d_h=2, p_h=2, name_scope='conv5') self.layers[16] = batch_norm(self.layers[15], 256, self.config['eps'], name_scope='conv5') self.layers[17] = relu(self.layers[16], name_scope='conv5') self.layers[18] = maxpool(self.layers[17], k_h=4, d_h=4, name_scope='conv5') self.layers[19] = conv2d(self.layers[18], 256, 512, k_h=4, d_h=2, p_h=2, name_scope='conv6') self.layers[20] = batch_norm(self.layers[19], 512, self.config['eps'], name_scope='conv6') self.layers[21] = relu(self.layers[20], name_scope='conv6') self.layers[22] = conv2d(self.layers[21], 512, 1024, k_h=4, d_h=2, p_h=2, name_scope='conv7') self.layers[23] = batch_norm(self.layers[22], 1024, self.config['eps'], name_scope='conv7') self.layers[24] = relu(self.layers[23], name_scope='conv7') # Split one: conv8, conv8_2 self.layers[25] = conv2d(self.layers[24], 1024, 1000, k_h=8, d_h=2, name_scope='conv8') self.layers[26] = conv2d(self.layers[24], 1024, 401, k_h=8, d_h=2, name_scope='conv8_2') def load(self): if self.param_G is None: return False data_dict = self.param_G for key in data_dict: with tf.variable_scope(self.config['name_scope'] + '/' + key, reuse=True): for subkey in data_dict[key]: try: var = tf.get_variable(subkey) self.sess.run(var.assign(data_dict[key][subkey])) print('Assign pretrain model {} to {}'.format(subkey, key)) except: print('Ignore {}'.format(key)) self.param_G.clear() return True if __name__ == '__main__': layer_min = int(sys.argv[1]) layer_max = int(sys.argv[2]) if len(sys.argv) > 2 else layer_min + 1 # Load pre-trained model G_name = './models/sound8.npy' param_G = np.load(G_name, encoding='latin1').item() dump_path = './output/' with tf.Session() as session: # Build model model = Model(session, config=local_config, param_G=param_G) init = tf.global_variables_initializer() session.run(init) model.load() # Demo sound_input = np.reshape(np.load('data/demo.npy', encoding='latin1'), [local_config['batch_size'], -1, 1, 1]) feed_dict = {model.sound_input_placeholder: sound_input} # Forward for idx in xrange(layer_min, layer_max): feature = session.run(model.layers[idx], feed_dict=feed_dict) np.save(dump_path + 'tf_fea{}.npy'.format(str(idx).zfill(2)), np.squeeze(feature)) print("Save layer {} with shape {} as {}tf_fea{}.npy".format(idx, np.squeeze(feature).shape, dump_path, str(idx).zfill(2))) ================================================ FILE: ops.py ================================================ # TensorFlow version of NIPS2016 soundnet import tensorflow as tf def conv2d(prev_layer, in_ch, out_ch, k_h=1, k_w=1, d_h=1, d_w=1, p_h=0, p_w=0, pad='VALID', name_scope='conv'): with tf.variable_scope(name_scope) as scope: # h x w x input_channel x output_channel w_conv = tf.get_variable('weights', [k_h, k_w, in_ch, out_ch], initializer=tf.truncated_normal_initializer(0.0, stddev=0.01)) b_conv = tf.get_variable('biases', [out_ch], initializer=tf.constant_initializer(0.0)) padded_input = tf.pad(prev_layer, [[0, 0], [p_h, p_h], [p_w, p_w], [0, 0]], "CONSTANT") if pad == 'VALID' \ else prev_layer output = tf.nn.conv2d(padded_input, w_conv, [1, d_h, d_w, 1], padding=pad, name='z') + b_conv return output def batch_norm(prev_layer, out_ch, eps, name_scope='conv'): with tf.variable_scope(name_scope) as scope: #mu_conv, var_conv = tf.nn.moments(prev_layer, [0, 1, 2], keep_dims=False) mu_conv = tf.get_variable('mean', [out_ch], initializer=tf.constant_initializer(0)) var_conv = tf.get_variable('var', [out_ch], initializer=tf.constant_initializer(1)) gamma_conv = tf.get_variable('gamma', [out_ch], initializer=tf.constant_initializer(1)) beta_conv = tf.get_variable('beta', [out_ch], initializer=tf.constant_initializer(0)) output = tf.nn.batch_normalization(prev_layer, mu_conv, var_conv, beta_conv, gamma_conv, eps, name='batch_norm') return output def relu(prev_layer, name_scope='conv'): with tf.variable_scope(name_scope) as scope: return tf.nn.relu(prev_layer, name='a') def maxpool(prev_layer, k_h=1, k_w=1, d_h=1, d_w=1, name_scope='conv'): with tf.variable_scope(name_scope) as scope: return tf.nn.max_pool(prev_layer, [1, k_h, k_w, 1], [1, d_h, d_w, 1], padding='VALID', name='maxpool') ================================================ FILE: util.py ================================================ import numpy as np import librosa import pdb local_config = { 'batch_size': 64, 'load_size': 22050*20, 'phase': 'extract' } def load_from_list(name_list, config=local_config): assert len(name_list) == config['batch_size'], \ "The length of name_list({})[{}] is not the same as batch_size[{}]".format( name_list[0], len(name_list), config['batch_size']) audios = np.zeros([config['batch_size'], config['load_size'], 1, 1]) for idx, audio_path in enumerate(name_list): sound_sample, _ = load_audio(audio_path) audios[idx] = preprocess(sound_sample, config) return audios def load_from_txt(txt_name, config=local_config): with open(txt_name, 'r') as handle: txt_list = handle.read().splitlines() audios = [] for idx, audio_path in enumerate(txt_list): sound_sample, _ = load_audio(audio_path) audios.append(preprocess(sound_sample, config)) return audios # NOTE: Load an audio as the same format in soundnet # 1. Keep original sample rate (which conflicts their own paper) # 2. Use first channel in multiple channels # 3. Keep range in [-256, 256] def load_audio(audio_path, sr=None): # By default, librosa will resample the signal to 22050Hz(sr=None). And range in (-1., 1.) sound_sample, sr = librosa.load(audio_path, sr=sr, mono=False) return sound_sample, sr def preprocess(raw_audio, config=local_config): # Select first channel (mono) if len(raw_audio.shape) > 1: raw_audio = raw_audio[0] # Make range [-256, 256] raw_audio *= 256.0 # Make minimum length available length = config['load_size'] if length > raw_audio.shape[0]: raw_audio = np.tile(raw_audio, length/raw_audio.shape[0] + 1) # Make equal training length if config['phase'] != 'extract': raw_audio = raw_audio[:length] # Check conditions assert len(raw_audio.shape) == 1, "It seems this audio contains two channels, we only need the first channel" assert np.max(raw_audio) <= 256, "It seems this audio contains signal that exceeds 256" assert np.min(raw_audio) >= -256, "It seems this audio contains signal that exceeds -256" # Shape to 1 x DIM x 1 x 1 raw_audio = np.reshape(raw_audio, [1, -1, 1, 1]) return raw_audio.copy()