[
  {
    "path": ".gitignore",
    "content": "# Data\ndata/*\n*.zip\noutput\nmodels/*\n\n# checkpoint\n*logs\n*checkpoint\n\n# trash\n.dropbox\n\n# Created by https://www.gitignore.io/api/python,vim\n\n### Python ###\n# Byte-compiled / optimized / DLL files\n__pycache__/\n*.py[cod]\n*$py.class\n\n# C extensions\n*.so\n\n# Distribution / packaging\n.Python\nenv/\nbuild/\ndevelop-eggs/\ndist/\ndownloads/\neggs/\n.eggs/\nlib/\nlib64/\nparts/\nsdist/\nvar/\n*.egg-info/\n.installed.cfg\n*.egg\n\n# PyInstaller\n#  Usually these files are written by a python script from a template\n#  before PyInstaller builds the exe, so as to inject date/other infos into it.\n*.manifest\n*.spec\n\n# Installer logs\npip-log.txt\npip-delete-this-directory.txt\n\n# Unit test / coverage reports\nhtmlcov/\n.tox/\n.coverage\n.coverage.*\n.cache\nnosetests.xml\ncoverage.xml\n*,cover\n.hypothesis/\n\n# Translations\n*.mo\n*.pot\n\n# Django stuff:\n*.log\n\n# Sphinx documentation\ndocs/_build/\n\n# PyBuilder\ntarget/\n\n\n### Vim ###\n[._]*.s[a-w][a-z]\n[._]s[a-w][a-z]\n*.un~\nSession.vim\n.netrwhist\n*~\n\n"
  },
  {
    "path": "LICENSE",
    "content": "MIT License\n\nCopyright (c) 2018 Hou-Ning Hu\n\nPermission is hereby granted, free of charge, to any person obtaining a copy\nof this software and associated documentation files (the \"Software\"), to deal\nin the Software without restriction, including without limitation the rights\nto use, copy, modify, merge, publish, distribute, sublicense, and/or sell\ncopies of the Software, and to permit persons to whom the Software is\nfurnished to do so, subject to the following conditions:\n\nThe above copyright notice and this permission notice shall be included in all\ncopies or substantial portions of the Software.\n\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\nIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\nFITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\nAUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\nLIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\nOUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\nSOFTWARE.\n"
  },
  {
    "path": "README.md",
    "content": "# SoundNet-tensorflow\nTensorFlow implementation of \"SoundNet\" that learns rich natural sound representations.\n\nCode for paper \"[SoundNet: Learning Sound Representations from Unlabeled Video](https://arxiv.org/abs/1610.09001)\" by Yusuf Aytar, Carl Vondrick, Antonio Torralba. NIPS 2016\n\n![from soundnet](https://camo.githubusercontent.com/0b88af5c13ba987a17dcf90cd58816cf8ef04554/687474703a2f2f70726f6a656374732e637361696c2e6d69742e6564752f736f756e646e65742f736f756e646e65742e6a7067)\n\n# Prerequisites\n\n- Linux\n- NVIDIA GPU + CUDA 8.0 + CuDNNv5.1\n- Python 2.7 with numpy or Python 3.5\n- [Tensorflow](https://www.tensorflow.org/) 1.0.0 (up to 1.3.0)\n- librosa\n\n\n# Getting Started\n- Clone this repo:\n```bash\ngit clone git@github.com:eborboihuc/SoundNet-tensorflow.git\ncd SoundNet-tensorflow\n```\n\n- Pretrained Model\n\nI provide pre-trained models that are ported from [soundnet](http://data.csail.mit.edu/soundnet/soundnet_models_public.zip). You can download the 8 layer model [here](https://drive.google.com/uc?export=download&id=0B9wE6h4m--wjR015M1RLZW45OEU). Please place it as `./models/sound8.npy` in your folder.\n\n- Data\n\nPrepare you input mp3 files and place them under `./data/`\n\nGenerate a input file txt and place it under `./`\n```txt\n./data/0001.mp3\n./data/0002.mp3\n./data/0003.mp3\n...\n```\n\nFollow the steps in [extract features](#feature-extraction)\n\n\n- NOTE\n\nIf you found out that [some audio with offset value `start` in FFMPEG will cause a tremendous difference between `torch audio` and `librosa`](#FAQs), please **convert it** with following command.\n```\nsox {input.mp3} {output.mp3} trim 0\n```\nAfter this, the result might be much better.\n\n# Demo\n\nFor demo, you can follow the following steps\n\ni) Download a converted npy file [demo.npy](https://drive.google.com/uc?export=download&id=0B9wE6h4m--wjcEtqQ3VIM1pvZ3c) and place it under `./data/`\n\nii) To extract multiple features from a pretrained model with torch `lua audio` loaded sound track:\nThe sound track is equivalent with torch version.\n```bash\npython extract_feat.py -m {start layer number} -x {end layer numbe} -s\n```\n\nThen you can compare the outputs with torch ones.\n\n# Feature Extraction \n\n## Minimum example\ni) Download input file [demo.mp3](https://drive.google.com/uc?export=download&id=0B9wE6h4m--wjTjVEWVI3dnBsTG8) and place it under `./data/`\n\nii) Prepare a file list in `txt` format (`demo.txt`) that includes the input mp3 file(s) and place it under `./`\n```txt\n./data/demo.mp3\n```\n\niii) Then extract features from raw wave in `demo.txt`:\nPlease put the demo mp3 under ./data/[demo.mp3](https://drive.google.com/uc?export=download&id=0B9wE6h4m--wjTjVEWVI3dnBsTG8)\n```bash\npython extract_feat.py -m {start layer number} -x {end layer numbe} -s -p extract -t demo.txt\n```\n\n## More options\n\nTo extract multiple features from a pretrained model with downloaded mp3 dataset:\n```bash\npython extract_feat.py -t {dataset_txt_name} -m {start layer number} -x {end layer numbe} -s -p extract\n```\n\ne.g. extract layer 4 to layer 17 and save as `./sound_out/tf_fea%02d.npy`:\n```bash\npython extract_feat.py -o sound_out -m 4 -x 17 -s -p extract\n```\n\nMore details are in:\n```bash\npython extract_feat.py -h\n```\n\n\n# Finetuning\nTo train from an existing model:\n```bash\npython main.py \n```\n\n# Training\nTo train from scratch:\n```bash\npython main.py -p train\n```\n\nTo extract features:\n```bash\npython main.py -p extract -m {start layer number} -x {end layer numbe} -s\n```\n\nMore details are in:\n```bash\npython main.py -h\n```\n\n# TODOs\n\n- [x] Change audio loader to soundnet format\n- [x] Make it compatible to Python 3 format\n- [ ] Batch Norm behaviour different from Torch\n- [ ] Fix conv8 padding issue in training phase\n- [ ] Change all `config` into `tf.app.flags`  \n- [ ] Change dummy distribution of scene and object to useful placeholder\n- [ ] Add sound and feature loader from [Data](https://projects.csail.mit.edu/soundnet/) section\n\n# Known issues\n\n- Loaded audio length is not consist in `torch7 audio` and `librosa`. Here is the [issue](https://github.com/soumith/lua---audio/issues/17#issuecomment-288648237)\n- Training with a short length audio will make conv8 complain about [output size would be negative](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/framework/common_shape_fns.cc#L45)\n\n\n# FAQs\n\n- Why my loaded sound wave is different from `torch7 audio` to `librosa`: Here is my [WiKi](https://github.com/eborboihuc/SoundNet-tensorflow/wiki/info.md)\n\n# Acknowledgments\n\nCode ported from [soundnet](https://github.com/cvondrick/soundnet). And Torch7-Tensorflow loader are from [tf_videogan](https://github.com/Yuliang-Zou/tf_videogan). Thanks for their excellent work!\n\n\n## Author\n\nHou-Ning Hu / [@eborboihuc](https://eborboihuc.github.io/)\n\n"
  },
  {
    "path": "cmp.py",
    "content": "import numpy as np\nimport sys\n\nname = sys.argv[1]\ndec  = int(sys.argv[2]) if len(sys.argv) >= 3 else 4\n\nth = np.load('output/demo_th.npy', encoding='latin1').item()['layer{}'.format(name)].T\ntf = np.load('output/tf_fea{}.npy'.format(str(name).zfill(2)), encoding='latin1')\nif name == '25':\n    tf = np.concatenate([tf, np.load('output/tf_fea26.npy', encoding='latin1')], 1)\n\nprint('Layer {}: tf.shape={}, th.shape={}'.format(name, tf.shape, th.shape))\nprint('TF:')\nprint(tf)\nprint('Torch:')\nprint(th)\n\nsize = tf.shape[0] if tf.shape[0] < th.shape[0] else th.shape[0]\n\nprint('Round to {} decimals'.format(dec))\ntf = np.round(tf, decimals=dec)\nth = np.round(th, decimals=dec)\nprint('Total Diff: {} Max Diff: {} Min Diff: {}'.format(\n    np.sum(abs(tf[:size] - th[:size])), \\\n    np.max(tf[:size] - th[:size]), \\\n    np.min(tf[:size] - th[:size])))\n"
  },
  {
    "path": "demo.txt",
    "content": "data/demo.mp3\n"
  },
  {
    "path": "extract_feat.py",
    "content": "# TensorFlow version of NIPS2016 soundnet\n\nfrom util import load_from_txt\nfrom model import Model\nimport tensorflow as tf\nimport numpy as np\nimport argparse\nimport sys\nimport os\n\n# Make xrange compatible in both Python 2, 3\ntry:\n    xrange\nexcept NameError:\n    xrange = range\n\nlocal_config = {  \n            'batch_size': 1, \n            'eps': 1e-5,\n            'sample_rate': 22050,\n            'load_size': 22050*20,\n            'name_scope': 'SoundNet',\n            'phase': 'extract',\n            }\n\ndef parse_args():\n    \"\"\" Parse input arguments \"\"\"\n    parser = argparse.ArgumentParser(description='Extract Feature')\n    \n    parser.add_argument('-t', '--txt', dest='audio_txt', help='target audio txt path. e.g., [demo.txt]', default='demo.txt')\n\n    parser.add_argument('-o', '--outpath', dest='outpath', help='output feature path. e.g., [output]', default='output')\n\n    parser.add_argument('-p', '--phase', dest='phase', help='demo or extract feature. e.g., [demo, extract]', default='demo')\n\n    parser.add_argument('-m', '--layer', dest='layer_min', help='start from which feature layer. e.g., [1]', type=int, default=1)\n\n    parser.add_argument('-x', dest='layer_max', help='end at which feature layer. e.g., [24]', type=int, default=None)\n    \n    parser.add_argument('-c', '--cuda', dest='cuda_device', help='which cuda device to use. e.g., [0]', default='0')\n\n    feature_parser = parser.add_mutually_exclusive_group(required=False)\n    feature_parser.add_argument('-s', '--save', dest='is_save', help='Turn on save mode. [False(default), True]', action='store_true')\n    parser.set_defaults(is_save=False)\n    \n    args = parser.parse_args()\n\n    return args\n\n\ndef extract_feat(model, sound_input, config):\n    layer_min = config.layer_min\n    layer_max = config.layer_max if config.layer_max is not None else layer_min + 1\n    \n    # Extract feature\n    features = {}\n    feed_dict = {model.sound_input_placeholder: sound_input}\n\n    for idx in xrange(layer_min, layer_max):\n        feature = model.sess.run(model.layers[idx], feed_dict=feed_dict)\n        features[idx] = feature\n        if config.is_save:\n            np.save(os.path.join(config.outpath, 'tf_fea{}.npy'.format( \\\n                str(idx).zfill(2))), np.squeeze(feature))\n            print(\"Save layer {} with shape {} as {}/tf_fea{}.npy\".format( \\\n                    idx, np.squeeze(feature).shape, config.outpath, str(idx).zfill(2)))\n    \n    return features\n\n\nif __name__ == '__main__':\n\n    args = parse_args()\n\n    # Setup visible device\n    os.environ[\"CUDA_VISIBLE_DEVICES\"] = args.cuda_device\n\n    # Load pre-trained model\n    G_name = './models/sound8.npy'\n    param_G = np.load(G_name, encoding = 'latin1').item()\n        \n    if args.phase == 'demo':\n        # Demo\n        sound_samples = [np.reshape(np.load('data/demo.npy', encoding='latin1'), [1, -1, 1, 1])]\n    else: \n        # Extract Feature\n        sound_samples = load_from_txt(args.audio_txt, config=local_config)\n    \n    # Make path\n    if not os.path.exists(args.outpath):\n        os.mkdir(args.outpath)\n\n    # Init. Session\n    sess_config = tf.ConfigProto()\n    sess_config.allow_soft_placement=True\n    sess_config.gpu_options.allow_growth = True\n    \n    with tf.Session(config=sess_config) as session:\n        # Build model\n        model = Model(session, config=local_config, param_G=param_G)\n        init = tf.global_variables_initializer()\n        session.run(init)\n        \n        model.load()\n    \n        for sound_sample in sound_samples:\n            output = extract_feat(model, sound_sample, args)\n"
  },
  {
    "path": "h5convert.py",
    "content": "import numpy as np\nimport h5py\nimport sys\n\n\nth = h5py.File(sys.argv[1], 'r')\nprint th.keys()\n\n\nif len(th.keys()) <= 1:\n    key = th.keys()[0]\n    npy = np.array(th[key])\nelse:\n    npy = {}\n    for key in th.keys():\n        npy[key] = np.array(th[key])\n\nnp.save(sys.argv[2], npy)\n\n\n"
  },
  {
    "path": "load_t7.py",
    "content": "# Load t7 files\n# Required package: torchfile. \n# $ pip install torchfile\n\nimport torchfile\nimport numpy as np\nimport pdb\n\n# Make xrange compatible in both Python 2, 3\ntry:\n    xrange\nexcept NameError:\n    xrange = range\n\nkeys = ['conv1', 'conv2', 'conv3', 'conv4', 'conv5', 'conv6',\n        'conv7', 'conv8', 'conv8_2']\n\ndef load(o, param_list):\n    \"\"\" Get torch7 weights into numpy array \"\"\"\n    try:\n        num = len(o['modules'])\n    except:\n        num = 0\n    \n    for i in xrange(num):\n        # 2D conv\n        if o['modules'][i]._typename == 'nn.SpatialConvolution' or \\\n            o['modules'][i]._typename == 'cudnn.SpatialConvolution':\n            temp = {'weights': o['modules'][i]['weight'].transpose((2,3,1,0)),\n                    'biases': o['modules'][i]['bias']}\n            param_list.append(temp)\n        # 2D deconv\n        elif o['modules'][i]._typename == 'nn.SpatialFullConvolution':\n            temp = {'weights': o['modules'][i]['weight'].transpose((2,3,1,0)),\n                    'biases': o['modules'][i]['bias']}\n            param_list.append(temp)\n        # 3D conv\n        elif o['modules'][i]._typename == 'nn.VolumetricFullConvolution':\n            temp = {'weights': o['modules'][i]['weight'].transpose((2,3,4,1,0)),\n                    'biases': o['modules'][i]['bias']}\n            param_list.append(temp)\n        # batch norm\n        elif o['modules'][i]._typename == 'nn.SpatialBatchNormalization' or \\\n            o['modules'][i]._typename == 'nn.VolumetricBatchNormalization':\n            param_list[-1]['gamma'] = o['modules'][i]['weight']\n            param_list[-1]['beta'] = o['modules'][i]['bias']\n            param_list[-1]['mean'] = o['modules'][i]['running_mean']\n            param_list[-1]['var'] = o['modules'][i]['running_var']\n\n        load(o['modules'][i], param_list)\n\n\ndef show(o):\n    \"\"\" Show nn information \"\"\"\n    nn = {}\n    nn_keys = {}\n    nn_info = {}\n    num = len(o['modules']) if o['modules'] else 0\n    mylist = get_mylist()\n\n    for i in xrange(num):\n        # Get _obj and keys from torchfile\n        nn[i] = o['modules'][i]._obj\n        nn_keys[i] = o['modules'][i]._obj.keys()\n        \n        # Get information from _obj\n        # {layer i: {mylist keys: value}}\n        nn_info[i] = {key: nn[i][key] for key in sorted(nn_keys[i]) if key in mylist}\n        nn_info[i]['name'] = o['modules'][i]._typename\n        print(i, nn_info[i]['name'])\n        for item in sorted(nn_info[i].keys()): \n            print(\"  {}:{}\".format(item, nn_info[i][item] if 'running' not in item \\\n                                                        else nn_info[i][item].shape))\n\n\ndef get_mylist():\n    \"\"\" Return manually selected information lists \"\"\"\n    return ['_type', 'nInputPlane', 'nOutputPlane', \\\n            'input_offset', 'groups', 'dH', 'dW', \\\n            'padH', 'padW', 'kH', 'kW', 'iSize', \\\n            'running_mean', 'running_var']\n\n\nif __name__ == '__main__':\n    # File loader\n    t7_file = './models/soundnet8_final.t7'\n    o = torchfile.load(t7_file)\n    \n    # To show nn parameter\n    show(o)\n    \n    # To store as npy file\n    param_list = []\n    load(o, param_list)\n    save_list = {}\n    for i, k in enumerate(keys):\n        save_list[k] = param_list[i]\n    np.save('sound8', save_list)\n\n"
  },
  {
    "path": "main.py",
    "content": "# TensorFlow version of NIPS2016 soundnet\n# Required package: librosa: A python package for music and audio analysis.\n# $ pip install librosa\n\nfrom ops import batch_norm, conv2d, relu, maxpool\nfrom util import preprocess, load_from_list, load_audio\nfrom model import Model\nfrom glob import glob\n\nimport tensorflow as tf\nimport numpy as np\nimport argparse\nimport time\nimport sys\nimport os\n\n\n# Make xrange compatible in both Python 2, 3\ntry:\n    xrange\nexcept NameError:\n    xrange = range\n\nlocal_config = {\n            'batch_size': 1, \n            'train_size': np.inf,\n            'epoch': 200,\n            'eps': 1e-5,\n            'learning_rate': 1e-3,\n            'beta1': 0.9,\n            'load_size': 22050*4,\n            'sample_rate': 22050,\n            'name_scope': 'SoundNet',\n            'phase': 'train',\n            'dataset_name': 'ESC50',\n            'subname': 'mp3',\n            'checkpoint_dir': 'checkpoint',\n            'dump_dir': 'output',\n            'model_dir': None,\n            'param_g_dir': './models/sound8.npy',\n            }\n\n\nclass Model():\n    def __init__(self, session, config=local_config, param_G=None):\n        self.sess           = session\n        self.config         = config\n        self.param_G        = param_G\n        self.g_step         = tf.Variable(0, trainable=False)\n        self.counter        = 0\n        self.model()\n \n\n    def model(self):\n        # Placeholder\n        self.sound_input_placeholder = tf.placeholder(tf.float32,\n                shape=[self.config['batch_size'], None, 1, 1]) # batch x h x w x channel\n        self.object_dist = tf.placeholder(tf.float32,\n                shape=[self.config['batch_size'], None, 1000]) # batch x h x w x channel\n        self.scene_dist = tf.placeholder(tf.float32,\n                shape=[self.config['batch_size'], None, 401]) # batch x h x w x channel\n        \n        # Generator\n        self.add_generator(name_scope=self.config['name_scope'])\n \n        # KL Divergence\n        self.object_loss = self.KL_divergence(self.layers[25], self.object_dist, name_scope='KL_Div_object')\n        self.scene_loss = self.KL_divergence(self.layers[26], self.scene_dist, name_scope='KL_Div_scene')\n        self.loss = self.object_loss + self.scene_loss\n\n        # Summary\n        self.loss_sum = tf.summary.scalar(\"g_loss\", self.loss)\n        self.g_sum = tf.summary.merge([self.loss_sum])\n        self.writer = tf.summary.FileWriter(\"./logs\", self.sess.graph)\n        \n        # variable collection\n        self.g_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, \n                                    scope=self.config['name_scope'])\n\n        self.saver = tf.train.Saver(keep_checkpoint_every_n_hours=12, \n                                    max_to_keep=5, \n                                    restore_sequentially=True)\n\n        # Optimizer and summary\n        self.g_optim = tf.train.AdamOptimizer(self.config['learning_rate'], beta1=self.config['beta1']) \\\n                          .minimize(self.loss, var_list=(self.g_vars), global_step=self.g_step)\n        \n        # Initialize\n        init_op = tf.global_variables_initializer()\n        self.sess.run(init_op)\n        \n        # Load checkpoint\n        if self.load(self.config['checkpoint_dir']):\n            print(\" [*] Load SUCCESS\")\n        else:\n            print(\" [!] Load failed...\")\n\n\n    def add_generator(self, name_scope='SoundNet'):\n        with tf.variable_scope(name_scope) as scope:\n            self.layers = {}\n            \n            # Stream one: conv1 ~ conv7\n            self.layers[1] = conv2d(self.sound_input_placeholder, 1, 16, k_h=64, d_h=2, p_h=32, name_scope='conv1')\n            self.layers[2] = batch_norm(self.layers[1], 16, self.config['eps'], name_scope='conv1')\n            self.layers[3] = relu(self.layers[2], name_scope='conv1')\n            self.layers[4] = maxpool(self.layers[3], k_h=8, d_h=8, name_scope='conv1')\n\n            self.layers[5] = conv2d(self.layers[4], 16, 32, k_h=32, d_h=2, p_h=16, name_scope='conv2')\n            self.layers[6] = batch_norm(self.layers[5], 32, self.config['eps'], name_scope='conv2')\n            self.layers[7] = relu(self.layers[6], name_scope='conv2')\n            self.layers[8] = maxpool(self.layers[7], k_h=8, d_h=8, name_scope='conv2')\n\n            self.layers[9] = conv2d(self.layers[8], 32, 64, k_h=16, d_h=2, p_h=8, name_scope='conv3')\n            self.layers[10] = batch_norm(self.layers[9], 64, self.config['eps'], name_scope='conv3')\n            self.layers[11] = relu(self.layers[10], name_scope='conv3')\n\n            self.layers[12] = conv2d(self.layers[11], 64, 128, k_h=8, d_h=2, p_h=4, name_scope='conv4')\n            self.layers[13] = batch_norm(self.layers[12], 128, self.config['eps'], name_scope='conv4')\n            self.layers[14] = relu(self.layers[13], name_scope='conv4')\n\n            self.layers[15] = conv2d(self.layers[14], 128, 256, k_h=4, d_h=2, p_h=2, name_scope='conv5')\n            self.layers[16] = batch_norm(self.layers[15], 256, self.config['eps'], name_scope='conv5')\n            self.layers[17] = relu(self.layers[16], name_scope='conv5')\n            self.layers[18] = maxpool(self.layers[17], k_h=4, d_h=4, name_scope='conv5')\n\n            self.layers[19] = conv2d(self.layers[18], 256, 512, k_h=4, d_h=2, p_h=2, name_scope='conv6')\n            self.layers[20] = batch_norm(self.layers[19], 512, self.config['eps'], name_scope='conv6')\n            self.layers[21] = relu(self.layers[20], name_scope='conv6')\n\n            self.layers[22] = conv2d(self.layers[21], 512, 1024, k_h=4, d_h=2, p_h=2, name_scope='conv7')\n            self.layers[23] = batch_norm(self.layers[22], 1024, self.config['eps'], name_scope='conv7')\n            self.layers[24] = relu(self.layers[23], name_scope='conv7')\n\n            # Split one: conv8, conv8_2\n            # NOTE: here we use a padding of 2 to skip an unknown error\n            # https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/framework/common_shape_fns.cc#L45\n            self.layers[25] = conv2d(self.layers[24], 1024, 1000, k_h=8, d_h=2, p_h=2, name_scope='conv8')\n            self.layers[26] = conv2d(self.layers[24], 1024, 401, k_h=8, d_h=2, p_h=2, name_scope='conv8_2')\n\n\n    def train(self):\n        \"\"\"Train SoundNet\"\"\"\n\n        start_time = time.time()\n\n        # Data info\n        data = glob('./data/*.{}'.format(self.config['subname']))\n        batch_idxs = min(len(data), self.config['train_size']) // self.config['batch_size']\n        for epoch in xrange(self.counter//batch_idxs, self.config['epoch']):\n\n            for idx in xrange(self.counter%batch_idxs, batch_idxs):\n        \n                # By default, librosa will resample the signal to 22050Hz. And range in (-1., 1.)\n                sound_sample = load_from_list(data[idx*self.config['batch_size']:(idx+1)*self.config['batch_size']], self.config)\n                \n                # Update G network\n                # NOTE: Here we still use dummy random distribution for scene and objects\n                _, summary_str, l_scn, l_obj = self.sess.run([self.g_optim, self.g_sum, self.scene_loss, self.object_loss],\n                    feed_dict={self.sound_input_placeholder: sound_sample, \\\n                            self.scene_dist: np.random.randint(2, size=(1, 1, 401)), \\\n                            self.object_dist: np.random.randint(2, size=(1, 1, 1000))})\n                self.writer.add_summary(summary_str, self.counter)\n\n                print (\"[Epoch {}] {}/{} | Time: {} | scene_loss: {} | obj_loss: {}\".format(epoch, idx, batch_idxs, time.time() - start_time, l_scn, l_obj))\n\n                if np.mod(self.counter, 1000) == 1000 - 1:\n                    self.save(self.config['checkpoint_dir'], self.counter)\n\n                self.counter += 1\n\n\n    #########################\n    #          Loss         #\n    #########################\n    # Adapt the answer here: http://stackoverflow.com/questions/41863814/kl-divergence-in-tensorflow\n    def KL_divergence(self, dist_a, dist_b, name_scope='KL_Div'):\n        return tf.reduce_mean(-tf.nn.softmax_cross_entropy_with_logits(logits=dist_a, labels=dist_b))\n\n\n    #########################\n    #       Save/Load       #\n    #########################\n    @property\n    def get_model_dir(self):\n        if self.config['model_dir'] is None:\n            return \"{}_{}\".format(\n                self.config['dataset_name'], self.config['batch_size'])\n        else:\n            return self.config['model_dir']\n    \n\n    def load(self, ckpt_dir='checkpoint'):\n        return self.load_from_ckpt(ckpt_dir) if self.param_G is None \\\n        else self.load_from_npy()\n\n\n    def save(self, checkpoint_dir, step):\n        \"\"\" Checkpoint saver \"\"\"\n        model_name = \"SoundNet.model\"\n        checkpoint_dir = os.path.join(checkpoint_dir, self.get_model_dir)\n\n        if not os.path.exists(checkpoint_dir):\n            os.makedirs(checkpoint_dir)\n\n        self.saver.save(self.sess,\n                        os.path.join(checkpoint_dir, model_name),\n                        global_step=step)\n\n\n    def load_from_ckpt(self, checkpoint_dir='checkpoint'):\n        \"\"\" Checkpoint loader \"\"\"\n        print(\" [*] Reading checkpoints...\")\n\n        checkpoint_dir = os.path.join(checkpoint_dir, self.get_model_dir)\n\n        ckpt = tf.train.get_checkpoint_state(checkpoint_dir)\n        if ckpt and ckpt.model_checkpoint_path:\n            ckpt_name = os.path.basename(ckpt.model_checkpoint_path)\n            self.saver.restore(self.sess, os.path.join(checkpoint_dir, ckpt_name))\n            print(\" [*] Success to read {}\".format(ckpt_name))\n            self.counter = int(ckpt_name.rsplit('-', 1)[-1])\n            print(\" [*] Start counter from {}\".format(self.counter))\n            return True\n        else:\n            print(\" [*] Failed to find a checkpoint under {}\".format(checkpoint_dir))\n            return False\n\n\n    def load_from_npy(self):\n        if self.param_G is None: return False\n        data_dict = self.param_G\n        for key in data_dict:\n            with tf.variable_scope(self.config['name_scope'] + '/'+ key, reuse=True):\n                for subkey in data_dict[key]:\n                    try:\n                        var = tf.get_variable(subkey)\n                        self.sess.run(var.assign(data_dict[key][subkey]))\n                        print('Assign pretrain model {} to {}'.format(subkey, key))\n                    except:\n                        print('Ignore {}'.format(key))\n        \n        self.param_G.clear()\n        return True\n\n\ndef main():\n\n    args = parse_args()\n    local_config['phase'] = args.phase\n\n    # Setup visible device\n    os.environ[\"CUDA_VISIBLE_DEVICES\"] = args.cuda_device\n\n    # Make path\n    if not os.path.exists(args.outpath):\n        os.mkdir(args.outpath)\n    \n    # Load pre-trained model\n    param_G = np.load(local_config['param_g_dir'], encoding='latin1').item() \\\n            if args.phase in ['finetune', 'extract'] \\\n            else None\n\n    # Init. Session\n    sess_config = tf.ConfigProto()\n    sess_config.allow_soft_placement=True\n    sess_config.gpu_options.allow_growth = True\n    \n    with tf.Session(config=sess_config) as session:\n        # Build model\n        model = Model(session, config=local_config, param_G=param_G)\n \n        if args.phase in ['train', 'finetune']:\n            # Training phase\n            model.train()\n        elif args.phase == 'extract':\n            # import when we need\n            from extract_feat import extract_feat\n\n            # Feature extractor\n            #sound_sample = np.reshape(np.load('./data/demo.npy', encoding='latin1'), [local_config['batch_size'], -1, 1, 1])\n            \n            import librosa\n            audio_path = './data/demo.mp3'\n            sound_sample, _ = load_audio(audio_path)\n            sound_sample = preprocess(sound_sample, config=local_config)\n\n            output = extract_feat(model, sound_sample, args)\n\n\ndef parse_args():\n    \"\"\" Parse input arguments \"\"\"\n    parser = argparse.ArgumentParser(description='SoundNet')\n    \n    parser.add_argument('-o', '--outpath', dest='outpath', help='output feature path. e.g., [output]', default='output')\n\n    parser.add_argument('-p', '--phase', dest='phase', help='demo or extract feature. e.g., [train, finetune, extract]', default='finetune')\n\n    parser.add_argument('-m', '--layer', dest='layer_min', help='start from which feature layer. e.g., [1]', type=int, default=1)\n\n    parser.add_argument('-x', dest='layer_max', help='end at which feature layer. e.g., [24]', type=int, default=None)\n    \n    parser.add_argument('-c', '--cuda', dest='cuda_device', help='which cuda device to use. e.g., [0]', default='0')\n\n    feature_parser = parser.add_mutually_exclusive_group(required=False)\n    feature_parser.add_argument('-s', '--save', dest='is_save', help='Turn on save mode. [False(default), True]', action='store_true')\n    parser.set_defaults(is_save=False)\n    \n    args = parser.parse_args()\n\n    return args\n\n\nif __name__ == '__main__':\n    main()\n"
  },
  {
    "path": "model.py",
    "content": "# TensorFlow version of NIPS2016 soundnet\n\nimport sys\nimport numpy as np\nimport tensorflow as tf\nfrom ops import batch_norm, conv2d, relu, maxpool\n\n# Make xrange compatible in both Python 2, 3\ntry:\n    xrange\nexcept NameError:\n    xrange = range\n\nlocal_config = {  \n            'batch_size': 1, \n            'eps': 1e-5,\n            'name_scope': 'SoundNet',\n            }\n\nclass Model():\n    def __init__(self, session, config=local_config, param_G=None):\n        # Print config\n        for key in config: print(\"{}:{}\".format(key, config[key]))\n\n        self.sess           = session\n        self.config         = config\n        self.param_G        = param_G\n        \n        # Placeholder\n        self.add_placeholders()\n        \n        # Generator\n        self.add_generator(name_scope=self.config['name_scope'])\n\n\n    def add_placeholders(self):\n        self.sound_input_placeholder = tf.placeholder(tf.float32,\n                shape=[self.config['batch_size'], None, 1, 1]) # batch x h x w x channel\n\n\n    def add_generator(self, name_scope='SoundNet'):\n        with tf.variable_scope(name_scope) as scope:\n            self.layers = {}\n            \n            # Stream one: conv1 ~ conv7\n            self.layers[1] = conv2d(self.sound_input_placeholder, 1, 16, k_h=64, d_h=2, p_h=32, name_scope='conv1')\n            self.layers[2] = batch_norm(self.layers[1], 16, self.config['eps'], name_scope='conv1')\n            self.layers[3] = relu(self.layers[2], name_scope='conv1')\n            self.layers[4] = maxpool(self.layers[3], k_h=8, d_h=8, name_scope='conv1')\n\n            self.layers[5] = conv2d(self.layers[4], 16, 32, k_h=32, d_h=2, p_h=16, name_scope='conv2')\n            self.layers[6] = batch_norm(self.layers[5], 32, self.config['eps'], name_scope='conv2')\n            self.layers[7] = relu(self.layers[6], name_scope='conv2')\n            self.layers[8] = maxpool(self.layers[7], k_h=8, d_h=8, name_scope='conv2')\n\n            self.layers[9] = conv2d(self.layers[8], 32, 64, k_h=16, d_h=2, p_h=8, name_scope='conv3')\n            self.layers[10] = batch_norm(self.layers[9], 64, self.config['eps'], name_scope='conv3')\n            self.layers[11] = relu(self.layers[10], name_scope='conv3')\n\n            self.layers[12] = conv2d(self.layers[11], 64, 128, k_h=8, d_h=2, p_h=4, name_scope='conv4')\n            self.layers[13] = batch_norm(self.layers[12], 128, self.config['eps'], name_scope='conv4')\n            self.layers[14] = relu(self.layers[13], name_scope='conv4')\n\n            self.layers[15] = conv2d(self.layers[14], 128, 256, k_h=4, d_h=2, p_h=2, name_scope='conv5')\n            self.layers[16] = batch_norm(self.layers[15], 256, self.config['eps'], name_scope='conv5')\n            self.layers[17] = relu(self.layers[16], name_scope='conv5')\n            self.layers[18] = maxpool(self.layers[17], k_h=4, d_h=4, name_scope='conv5')\n\n            self.layers[19] = conv2d(self.layers[18], 256, 512, k_h=4, d_h=2, p_h=2, name_scope='conv6')\n            self.layers[20] = batch_norm(self.layers[19], 512, self.config['eps'], name_scope='conv6')\n            self.layers[21] = relu(self.layers[20], name_scope='conv6')\n\n            self.layers[22] = conv2d(self.layers[21], 512, 1024, k_h=4, d_h=2, p_h=2, name_scope='conv7')\n            self.layers[23] = batch_norm(self.layers[22], 1024, self.config['eps'], name_scope='conv7')\n            self.layers[24] = relu(self.layers[23], name_scope='conv7')\n\n            # Split one: conv8, conv8_2\n            self.layers[25] = conv2d(self.layers[24], 1024, 1000, k_h=8, d_h=2, name_scope='conv8')\n            self.layers[26] = conv2d(self.layers[24], 1024, 401, k_h=8, d_h=2, name_scope='conv8_2')\n\n\n    def load(self):\n        if self.param_G is None: return False\n        data_dict = self.param_G\n        for key in data_dict:\n            with tf.variable_scope(self.config['name_scope'] + '/' + key, reuse=True):\n                for subkey in data_dict[key]:\n                    try:\n                        var = tf.get_variable(subkey)\n                        self.sess.run(var.assign(data_dict[key][subkey]))\n                        print('Assign pretrain model {} to {}'.format(subkey, key))\n                    except:\n                        print('Ignore {}'.format(key))\n        self.param_G.clear()\n        return True\n\n\nif __name__ == '__main__':\n    \n    layer_min = int(sys.argv[1])\n    layer_max = int(sys.argv[2]) if len(sys.argv) > 2 else layer_min + 1\n    \n    # Load pre-trained model\n    G_name = './models/sound8.npy'\n    param_G = np.load(G_name, encoding='latin1').item()\n    dump_path = './output/'\n\n    with tf.Session() as session:\n        # Build model\n        model = Model(session, config=local_config, param_G=param_G)\n        init = tf.global_variables_initializer()\n        session.run(init)\n        \n        model.load()\n        \n        # Demo\n        sound_input = np.reshape(np.load('data/demo.npy', encoding='latin1'), [local_config['batch_size'], -1, 1, 1])\n        feed_dict = {model.sound_input_placeholder: sound_input}\n        \n        # Forward\n        for idx in xrange(layer_min, layer_max):\n            feature = session.run(model.layers[idx], feed_dict=feed_dict)\n            np.save(dump_path + 'tf_fea{}.npy'.format(str(idx).zfill(2)), np.squeeze(feature))\n            print(\"Save layer {} with shape {} as {}tf_fea{}.npy\".format(idx, np.squeeze(feature).shape, dump_path, str(idx).zfill(2)))\n\n"
  },
  {
    "path": "ops.py",
    "content": "# TensorFlow version of NIPS2016 soundnet\nimport tensorflow as tf\n\ndef conv2d(prev_layer, in_ch, out_ch, k_h=1, k_w=1, d_h=1, d_w=1, p_h=0, p_w=0, pad='VALID', name_scope='conv'):\n    with tf.variable_scope(name_scope) as scope:\n        # h x w x input_channel x output_channel\n        w_conv = tf.get_variable('weights', [k_h, k_w, in_ch, out_ch], \n                initializer=tf.truncated_normal_initializer(0.0, stddev=0.01))\n        b_conv = tf.get_variable('biases', [out_ch], \n                initializer=tf.constant_initializer(0.0))\n        \n        padded_input = tf.pad(prev_layer, [[0, 0], [p_h, p_h], [p_w, p_w], [0, 0]], \"CONSTANT\") if pad == 'VALID' \\\n                else prev_layer\n\n        output = tf.nn.conv2d(padded_input, w_conv, \n                [1, d_h, d_w, 1], padding=pad, name='z') + b_conv\n    \n        return output\n\n\ndef batch_norm(prev_layer, out_ch, eps, name_scope='conv'):\n    with tf.variable_scope(name_scope) as scope:\n        #mu_conv, var_conv = tf.nn.moments(prev_layer, [0, 1, 2], keep_dims=False)\n        mu_conv = tf.get_variable('mean', [out_ch], \n            initializer=tf.constant_initializer(0))\n        var_conv = tf.get_variable('var', [out_ch], \n            initializer=tf.constant_initializer(1))\n        gamma_conv = tf.get_variable('gamma', [out_ch], \n            initializer=tf.constant_initializer(1))\n        beta_conv = tf.get_variable('beta', [out_ch], \n            initializer=tf.constant_initializer(0))\n        output = tf.nn.batch_normalization(prev_layer, mu_conv, \n            var_conv, beta_conv, gamma_conv, eps, name='batch_norm')\n        \n        return output\n\n\ndef relu(prev_layer, name_scope='conv'):\n    with tf.variable_scope(name_scope) as scope:\n        return tf.nn.relu(prev_layer, name='a')\n\n\ndef maxpool(prev_layer, k_h=1, k_w=1, d_h=1, d_w=1, name_scope='conv'):\n    with tf.variable_scope(name_scope) as scope:\n        return tf.nn.max_pool(prev_layer, \n                [1, k_h, k_w, 1], [1, d_h, d_w, 1], padding='VALID', name='maxpool')\n"
  },
  {
    "path": "util.py",
    "content": "import numpy as np\nimport librosa\nimport pdb\n\nlocal_config = {\n            'batch_size': 64, \n            'load_size': 22050*20,\n            'phase': 'extract'\n            }\n\n\ndef load_from_list(name_list, config=local_config):\n    assert len(name_list) == config['batch_size'], \\\n            \"The length of name_list({})[{}] is not the same as batch_size[{}]\".format(\n                    name_list[0], len(name_list), config['batch_size'])\n    audios = np.zeros([config['batch_size'], config['load_size'], 1, 1])\n    for idx, audio_path in enumerate(name_list):\n        sound_sample, _ = load_audio(audio_path)\n        audios[idx] = preprocess(sound_sample, config)\n        \n    return audios\n\n\ndef load_from_txt(txt_name, config=local_config):\n    with open(txt_name, 'r') as handle:\n        txt_list = handle.read().splitlines()\n\n    audios = []\n    for idx, audio_path in enumerate(txt_list):\n        sound_sample, _ = load_audio(audio_path)\n        audios.append(preprocess(sound_sample, config))\n        \n    return audios\n\n\n# NOTE: Load an audio as the same format in soundnet\n# 1. Keep original sample rate (which conflicts their own paper)\n# 2. Use first channel in multiple channels\n# 3. Keep range in [-256, 256]\n\ndef load_audio(audio_path, sr=None):\n    # By default, librosa will resample the signal to 22050Hz(sr=None). And range in (-1., 1.)\n    sound_sample, sr = librosa.load(audio_path, sr=sr, mono=False)\n\n    return sound_sample, sr\n\n\ndef preprocess(raw_audio, config=local_config):\n    # Select first channel (mono)\n    if len(raw_audio.shape) > 1:\n        raw_audio = raw_audio[0]\n\n    # Make range [-256, 256]\n    raw_audio *= 256.0\n\n    # Make minimum length available\n    length = config['load_size']\n    if length > raw_audio.shape[0]:\n        raw_audio = np.tile(raw_audio, length/raw_audio.shape[0] + 1)\n\n    # Make equal training length\n    if config['phase'] != 'extract':\n        raw_audio = raw_audio[:length]\n\n    # Check conditions\n    assert len(raw_audio.shape) == 1, \"It seems this audio contains two channels, we only need the first channel\"\n    assert np.max(raw_audio) <= 256, \"It seems this audio contains signal that exceeds 256\"\n    assert np.min(raw_audio) >= -256, \"It seems this audio contains signal that exceeds -256\"\n\n    # Shape to 1 x DIM x 1 x 1\n    raw_audio = np.reshape(raw_audio, [1, -1, 1, 1])\n\n    return raw_audio.copy()\n\n\n"
  }
]