Repository: eborboihuc/SoundNet-tensorflow
Branch: master
Commit: b603cd4584a9
Files: 12
Total size: 36.7 KB

Directory structure:
gitextract_78ltv8rb/

├── .gitignore
├── LICENSE
├── README.md
├── cmp.py
├── demo.txt
├── extract_feat.py
├── h5convert.py
├── load_t7.py
├── main.py
├── model.py
├── ops.py
└── util.py

================================================
FILE CONTENTS
================================================

================================================
FILE: .gitignore
================================================
# Data
data/*
*.zip
output
models/*

# checkpoint
*logs
*checkpoint

# trash
.dropbox

# Created by https://www.gitignore.io/api/python,vim

### Python ###
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
env/
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
*.egg-info/
.installed.cfg
*.egg

# PyInstaller
#  Usually these files are written by a python script from a template
#  before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*,cover
.hypothesis/

# Translations
*.mo
*.pot

# Django stuff:
*.log

# Sphinx documentation
docs/_build/

# PyBuilder
target/


### Vim ###
[._]*.s[a-w][a-z]
[._]s[a-w][a-z]
*.un~
Session.vim
.netrwhist
*~


================================================
FILE: LICENSE
================================================
MIT License

Copyright (c) 2018 Hou-Ning Hu

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.


================================================
FILE: README.md
================================================
# SoundNet-tensorflow
TensorFlow implementation of "SoundNet" that learns rich natural sound representations.

Code for paper "[SoundNet: Learning Sound Representations from Unlabeled Video](https://arxiv.org/abs/1610.09001)" by Yusuf Aytar, Carl Vondrick, Antonio Torralba. NIPS 2016

![from soundnet](https://camo.githubusercontent.com/0b88af5c13ba987a17dcf90cd58816cf8ef04554/687474703a2f2f70726f6a656374732e637361696c2e6d69742e6564752f736f756e646e65742f736f756e646e65742e6a7067)

# Prerequisites

- Linux
- NVIDIA GPU + CUDA 8.0 + CuDNNv5.1
- Python 2.7 with numpy or Python 3.5
- [Tensorflow](https://www.tensorflow.org/) 1.0.0 (up to 1.3.0)
- librosa


# Getting Started
- Clone this repo:
```bash
git clone git@github.com:eborboihuc/SoundNet-tensorflow.git
cd SoundNet-tensorflow
```

- Pretrained Model

I provide pre-trained models that are ported from [soundnet](http://data.csail.mit.edu/soundnet/soundnet_models_public.zip). You can download the 8 layer model [here](https://drive.google.com/uc?export=download&id=0B9wE6h4m--wjR015M1RLZW45OEU). Please place it as `./models/sound8.npy` in your folder.

- Data

Prepare you input mp3 files and place them under `./data/`

Generate a input file txt and place it under `./`
```txt
./data/0001.mp3
./data/0002.mp3
./data/0003.mp3
...
```

Follow the steps in [extract features](#feature-extraction)


- NOTE

If you found out that [some audio with offset value `start` in FFMPEG will cause a tremendous difference between `torch audio` and `librosa`](#FAQs), please **convert it** with following command.
```
sox {input.mp3} {output.mp3} trim 0
```
After this, the result might be much better.

# Demo

For demo, you can follow the following steps

i) Download a converted npy file [demo.npy](https://drive.google.com/uc?export=download&id=0B9wE6h4m--wjcEtqQ3VIM1pvZ3c) and place it under `./data/`

ii) To extract multiple features from a pretrained model with torch `lua audio` loaded sound track:
The sound track is equivalent with torch version.
```bash
python extract_feat.py -m {start layer number} -x {end layer numbe} -s
```

Then you can compare the outputs with torch ones.

# Feature Extraction 

## Minimum example
i) Download input file [demo.mp3](https://drive.google.com/uc?export=download&id=0B9wE6h4m--wjTjVEWVI3dnBsTG8) and place it under `./data/`

ii) Prepare a file list in `txt` format (`demo.txt`) that includes the input mp3 file(s) and place it under `./`
```txt
./data/demo.mp3
```

iii) Then extract features from raw wave in `demo.txt`:
Please put the demo mp3 under ./data/[demo.mp3](https://drive.google.com/uc?export=download&id=0B9wE6h4m--wjTjVEWVI3dnBsTG8)
```bash
python extract_feat.py -m {start layer number} -x {end layer numbe} -s -p extract -t demo.txt
```

## More options

To extract multiple features from a pretrained model with downloaded mp3 dataset:
```bash
python extract_feat.py -t {dataset_txt_name} -m {start layer number} -x {end layer numbe} -s -p extract
```

e.g. extract layer 4 to layer 17 and save as `./sound_out/tf_fea%02d.npy`:
```bash
python extract_feat.py -o sound_out -m 4 -x 17 -s -p extract
```

More details are in:
```bash
python extract_feat.py -h
```


# Finetuning
To train from an existing model:
```bash
python main.py 
```

# Training
To train from scratch:
```bash
python main.py -p train
```

To extract features:
```bash
python main.py -p extract -m {start layer number} -x {end layer numbe} -s
```

More details are in:
```bash
python main.py -h
```

# TODOs

- [x] Change audio loader to soundnet format
- [x] Make it compatible to Python 3 format
- [ ] Batch Norm behaviour different from Torch
- [ ] Fix conv8 padding issue in training phase
- [ ] Change all `config` into `tf.app.flags`  
- [ ] Change dummy distribution of scene and object to useful placeholder
- [ ] Add sound and feature loader from [Data](https://projects.csail.mit.edu/soundnet/) section

# Known issues

- Loaded audio length is not consist in `torch7 audio` and `librosa`. Here is the [issue](https://github.com/soumith/lua---audio/issues/17#issuecomment-288648237)
- Training with a short length audio will make conv8 complain about [output size would be negative](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/framework/common_shape_fns.cc#L45)


# FAQs

- Why my loaded sound wave is different from `torch7 audio` to `librosa`: Here is my [WiKi](https://github.com/eborboihuc/SoundNet-tensorflow/wiki/info.md)

# Acknowledgments

Code ported from [soundnet](https://github.com/cvondrick/soundnet). And Torch7-Tensorflow loader are from [tf_videogan](https://github.com/Yuliang-Zou/tf_videogan). Thanks for their excellent work!


## Author

Hou-Ning Hu / [@eborboihuc](https://eborboihuc.github.io/)


================================================
FILE: cmp.py
================================================
import numpy as np
import sys

name = sys.argv[1]
dec  = int(sys.argv[2]) if len(sys.argv) >= 3 else 4

th = np.load('output/demo_th.npy', encoding='latin1').item()['layer{}'.format(name)].T
tf = np.load('output/tf_fea{}.npy'.format(str(name).zfill(2)), encoding='latin1')
if name == '25':
    tf = np.concatenate([tf, np.load('output/tf_fea26.npy', encoding='latin1')], 1)

print('Layer {}: tf.shape={}, th.shape={}'.format(name, tf.shape, th.shape))
print('TF:')
print(tf)
print('Torch:')
print(th)

size = tf.shape[0] if tf.shape[0] < th.shape[0] else th.shape[0]

print('Round to {} decimals'.format(dec))
tf = np.round(tf, decimals=dec)
th = np.round(th, decimals=dec)
print('Total Diff: {} Max Diff: {} Min Diff: {}'.format(
    np.sum(abs(tf[:size] - th[:size])), \
    np.max(tf[:size] - th[:size]), \
    np.min(tf[:size] - th[:size])))


================================================
FILE: demo.txt
================================================
data/demo.mp3


================================================
FILE: extract_feat.py
================================================
# TensorFlow version of NIPS2016 soundnet

from util import load_from_txt
from model import Model
import tensorflow as tf
import numpy as np
import argparse
import sys
import os

# Make xrange compatible in both Python 2, 3
try:
    xrange
except NameError:
    xrange = range

local_config = {  
            'batch_size': 1, 
            'eps': 1e-5,
            'sample_rate': 22050,
            'load_size': 22050*20,
            'name_scope': 'SoundNet',
            'phase': 'extract',
            }

def parse_args():
    """ Parse input arguments """
    parser = argparse.ArgumentParser(description='Extract Feature')
    
    parser.add_argument('-t', '--txt', dest='audio_txt', help='target audio txt path. e.g., [demo.txt]', default='demo.txt')

    parser.add_argument('-o', '--outpath', dest='outpath', help='output feature path. e.g., [output]', default='output')

    parser.add_argument('-p', '--phase', dest='phase', help='demo or extract feature. e.g., [demo, extract]', default='demo')

    parser.add_argument('-m', '--layer', dest='layer_min', help='start from which feature layer. e.g., [1]', type=int, default=1)

    parser.add_argument('-x', dest='layer_max', help='end at which feature layer. e.g., [24]', type=int, default=None)
    
    parser.add_argument('-c', '--cuda', dest='cuda_device', help='which cuda device to use. e.g., [0]', default='0')

    feature_parser = parser.add_mutually_exclusive_group(required=False)
    feature_parser.add_argument('-s', '--save', dest='is_save', help='Turn on save mode. [False(default), True]', action='store_true')
    parser.set_defaults(is_save=False)
    
    args = parser.parse_args()

    return args


def extract_feat(model, sound_input, config):
    layer_min = config.layer_min
    layer_max = config.layer_max if config.layer_max is not None else layer_min + 1
    
    # Extract feature
    features = {}
    feed_dict = {model.sound_input_placeholder: sound_input}

    for idx in xrange(layer_min, layer_max):
        feature = model.sess.run(model.layers[idx], feed_dict=feed_dict)
        features[idx] = feature
        if config.is_save:
            np.save(os.path.join(config.outpath, 'tf_fea{}.npy'.format( \
                str(idx).zfill(2))), np.squeeze(feature))
            print("Save layer {} with shape {} as {}/tf_fea{}.npy".format( \
                    idx, np.squeeze(feature).shape, config.outpath, str(idx).zfill(2)))
    
    return features


if __name__ == '__main__':

    args = parse_args()

    # Setup visible device
    os.environ["CUDA_VISIBLE_DEVICES"] = args.cuda_device

    # Load pre-trained model
    G_name = './models/sound8.npy'
    param_G = np.load(G_name, encoding = 'latin1').item()
        
    if args.phase == 'demo':
        # Demo
        sound_samples = [np.reshape(np.load('data/demo.npy', encoding='latin1'), [1, -1, 1, 1])]
    else: 
        # Extract Feature
        sound_samples = load_from_txt(args.audio_txt, config=local_config)
    
    # Make path
    if not os.path.exists(args.outpath):
        os.mkdir(args.outpath)

    # Init. Session
    sess_config = tf.ConfigProto()
    sess_config.allow_soft_placement=True
    sess_config.gpu_options.allow_growth = True
    
    with tf.Session(config=sess_config) as session:
        # Build model
        model = Model(session, config=local_config, param_G=param_G)
        init = tf.global_variables_initializer()
        session.run(init)
        
        model.load()
    
        for sound_sample in sound_samples:
            output = extract_feat(model, sound_sample, args)


================================================
FILE: h5convert.py
================================================
import numpy as np
import h5py
import sys


th = h5py.File(sys.argv[1], 'r')
print th.keys()


if len(th.keys()) <= 1:
    key = th.keys()[0]
    npy = np.array(th[key])
else:
    npy = {}
    for key in th.keys():
        npy[key] = np.array(th[key])

np.save(sys.argv[2], npy)


================================================
FILE: load_t7.py
================================================
# Load t7 files
# Required package: torchfile. 
# $ pip install torchfile

import torchfile
import numpy as np
import pdb

# Make xrange compatible in both Python 2, 3
try:
    xrange
except NameError:
    xrange = range

keys = ['conv1', 'conv2', 'conv3', 'conv4', 'conv5', 'conv6',
        'conv7', 'conv8', 'conv8_2']

def load(o, param_list):
    """ Get torch7 weights into numpy array """
    try:
        num = len(o['modules'])
    except:
        num = 0
    
    for i in xrange(num):
        # 2D conv
        if o['modules'][i]._typename == 'nn.SpatialConvolution' or \
            o['modules'][i]._typename == 'cudnn.SpatialConvolution':
            temp = {'weights': o['modules'][i]['weight'].transpose((2,3,1,0)),
                    'biases': o['modules'][i]['bias']}
            param_list.append(temp)
        # 2D deconv
        elif o['modules'][i]._typename == 'nn.SpatialFullConvolution':
            temp = {'weights': o['modules'][i]['weight'].transpose((2,3,1,0)),
                    'biases': o['modules'][i]['bias']}
            param_list.append(temp)
        # 3D conv
        elif o['modules'][i]._typename == 'nn.VolumetricFullConvolution':
            temp = {'weights': o['modules'][i]['weight'].transpose((2,3,4,1,0)),
                    'biases': o['modules'][i]['bias']}
            param_list.append(temp)
        # batch norm
        elif o['modules'][i]._typename == 'nn.SpatialBatchNormalization' or \
            o['modules'][i]._typename == 'nn.VolumetricBatchNormalization':
            param_list[-1]['gamma'] = o['modules'][i]['weight']
            param_list[-1]['beta'] = o['modules'][i]['bias']
            param_list[-1]['mean'] = o['modules'][i]['running_mean']
            param_list[-1]['var'] = o['modules'][i]['running_var']

        load(o['modules'][i], param_list)


def show(o):
    """ Show nn information """
    nn = {}
    nn_keys = {}
    nn_info = {}
    num = len(o['modules']) if o['modules'] else 0
    mylist = get_mylist()

    for i in xrange(num):
        # Get _obj and keys from torchfile
        nn[i] = o['modules'][i]._obj
        nn_keys[i] = o['modules'][i]._obj.keys()
        
        # Get information from _obj
        # {layer i: {mylist keys: value}}
        nn_info[i] = {key: nn[i][key] for key in sorted(nn_keys[i]) if key in mylist}
        nn_info[i]['name'] = o['modules'][i]._typename
        print(i, nn_info[i]['name'])
        for item in sorted(nn_info[i].keys()): 
            print("  {}:{}".format(item, nn_info[i][item] if 'running' not in item \
                                                        else nn_info[i][item].shape))


def get_mylist():
    """ Return manually selected information lists """
    return ['_type', 'nInputPlane', 'nOutputPlane', \
            'input_offset', 'groups', 'dH', 'dW', \
            'padH', 'padW', 'kH', 'kW', 'iSize', \
            'running_mean', 'running_var']


if __name__ == '__main__':
    # File loader
    t7_file = './models/soundnet8_final.t7'
    o = torchfile.load(t7_file)
    
    # To show nn parameter
    show(o)
    
    # To store as npy file
    param_list = []
    load(o, param_list)
    save_list = {}
    for i, k in enumerate(keys):
        save_list[k] = param_list[i]
    np.save('sound8', save_list)


================================================
FILE: main.py
================================================
# TensorFlow version of NIPS2016 soundnet
# Required package: librosa: A python package for music and audio analysis.
# $ pip install librosa

from ops import batch_norm, conv2d, relu, maxpool
from util import preprocess, load_from_list, load_audio
from model import Model
from glob import glob

import tensorflow as tf
import numpy as np
import argparse
import time
import sys
import os


# Make xrange compatible in both Python 2, 3
try:
    xrange
except NameError:
    xrange = range

local_config = {
            'batch_size': 1, 
            'train_size': np.inf,
            'epoch': 200,
            'eps': 1e-5,
            'learning_rate': 1e-3,
            'beta1': 0.9,
            'load_size': 22050*4,
            'sample_rate': 22050,
            'name_scope': 'SoundNet',
            'phase': 'train',
            'dataset_name': 'ESC50',
            'subname': 'mp3',
            'checkpoint_dir': 'checkpoint',
            'dump_dir': 'output',
            'model_dir': None,
            'param_g_dir': './models/sound8.npy',
            }


class Model():
    def __init__(self, session, config=local_config, param_G=None):
        self.sess           = session
        self.config         = config
        self.param_G        = param_G
        self.g_step         = tf.Variable(0, trainable=False)
        self.counter        = 0
        self.model()
 

    def model(self):
        # Placeholder
        self.sound_input_placeholder = tf.placeholder(tf.float32,
                shape=[self.config['batch_size'], None, 1, 1]) # batch x h x w x channel
        self.object_dist = tf.placeholder(tf.float32,
                shape=[self.config['batch_size'], None, 1000]) # batch x h x w x channel
        self.scene_dist = tf.placeholder(tf.float32,
                shape=[self.config['batch_size'], None, 401]) # batch x h x w x channel
        
        # Generator
        self.add_generator(name_scope=self.config['name_scope'])
 
        # KL Divergence
        self.object_loss = self.KL_divergence(self.layers[25], self.object_dist, name_scope='KL_Div_object')
        self.scene_loss = self.KL_divergence(self.layers[26], self.scene_dist, name_scope='KL_Div_scene')
        self.loss = self.object_loss + self.scene_loss

        # Summary
        self.loss_sum = tf.summary.scalar("g_loss", self.loss)
        self.g_sum = tf.summary.merge([self.loss_sum])
        self.writer = tf.summary.FileWriter("./logs", self.sess.graph)
        
        # variable collection
        self.g_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, 
                                    scope=self.config['name_scope'])

        self.saver = tf.train.Saver(keep_checkpoint_every_n_hours=12, 
                                    max_to_keep=5, 
                                    restore_sequentially=True)

        # Optimizer and summary
        self.g_optim = tf.train.AdamOptimizer(self.config['learning_rate'], beta1=self.config['beta1']) \
                          .minimize(self.loss, var_list=(self.g_vars), global_step=self.g_step)
        
        # Initialize
        init_op = tf.global_variables_initializer()
        self.sess.run(init_op)
        
        # Load checkpoint
        if self.load(self.config['checkpoint_dir']):
            print(" [*] Load SUCCESS")
        else:
            print(" [!] Load failed...")


    def add_generator(self, name_scope='SoundNet'):
        with tf.variable_scope(name_scope) as scope:
            self.layers = {}
            
            # Stream one: conv1 ~ conv7
            self.layers[1] = conv2d(self.sound_input_placeholder, 1, 16, k_h=64, d_h=2, p_h=32, name_scope='conv1')
            self.layers[2] = batch_norm(self.layers[1], 16, self.config['eps'], name_scope='conv1')
            self.layers[3] = relu(self.layers[2], name_scope='conv1')
            self.layers[4] = maxpool(self.layers[3], k_h=8, d_h=8, name_scope='conv1')

            self.layers[5] = conv2d(self.layers[4], 16, 32, k_h=32, d_h=2, p_h=16, name_scope='conv2')
            self.layers[6] = batch_norm(self.layers[5], 32, self.config['eps'], name_scope='conv2')
            self.layers[7] = relu(self.layers[6], name_scope='conv2')
            self.layers[8] = maxpool(self.layers[7], k_h=8, d_h=8, name_scope='conv2')

            self.layers[9] = conv2d(self.layers[8], 32, 64, k_h=16, d_h=2, p_h=8, name_scope='conv3')
            self.layers[10] = batch_norm(self.layers[9], 64, self.config['eps'], name_scope='conv3')
            self.layers[11] = relu(self.layers[10], name_scope='conv3')

            self.layers[12] = conv2d(self.layers[11], 64, 128, k_h=8, d_h=2, p_h=4, name_scope='conv4')
            self.layers[13] = batch_norm(self.layers[12], 128, self.config['eps'], name_scope='conv4')
            self.layers[14] = relu(self.layers[13], name_scope='conv4')

            self.layers[15] = conv2d(self.layers[14], 128, 256, k_h=4, d_h=2, p_h=2, name_scope='conv5')
            self.layers[16] = batch_norm(self.layers[15], 256, self.config['eps'], name_scope='conv5')
            self.layers[17] = relu(self.layers[16], name_scope='conv5')
            self.layers[18] = maxpool(self.layers[17], k_h=4, d_h=4, name_scope='conv5')

            self.layers[19] = conv2d(self.layers[18], 256, 512, k_h=4, d_h=2, p_h=2, name_scope='conv6')
            self.layers[20] = batch_norm(self.layers[19], 512, self.config['eps'], name_scope='conv6')
            self.layers[21] = relu(self.layers[20], name_scope='conv6')

            self.layers[22] = conv2d(self.layers[21], 512, 1024, k_h=4, d_h=2, p_h=2, name_scope='conv7')
            self.layers[23] = batch_norm(self.layers[22], 1024, self.config['eps'], name_scope='conv7')
            self.layers[24] = relu(self.layers[23], name_scope='conv7')

            # Split one: conv8, conv8_2
            # NOTE: here we use a padding of 2 to skip an unknown error
            # https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/framework/common_shape_fns.cc#L45
            self.layers[25] = conv2d(self.layers[24], 1024, 1000, k_h=8, d_h=2, p_h=2, name_scope='conv8')
            self.layers[26] = conv2d(self.layers[24], 1024, 401, k_h=8, d_h=2, p_h=2, name_scope='conv8_2')


    def train(self):
        """Train SoundNet"""

        start_time = time.time()

        # Data info
        data = glob('./data/*.{}'.format(self.config['subname']))
        batch_idxs = min(len(data), self.config['train_size']) // self.config['batch_size']
        for epoch in xrange(self.counter//batch_idxs, self.config['epoch']):

            for idx in xrange(self.counter%batch_idxs, batch_idxs):
        
                # By default, librosa will resample the signal to 22050Hz. And range in (-1., 1.)
                sound_sample = load_from_list(data[idx*self.config['batch_size']:(idx+1)*self.config['batch_size']], self.config)
                
                # Update G network
                # NOTE: Here we still use dummy random distribution for scene and objects
                _, summary_str, l_scn, l_obj = self.sess.run([self.g_optim, self.g_sum, self.scene_loss, self.object_loss],
                    feed_dict={self.sound_input_placeholder: sound_sample, \
                            self.scene_dist: np.random.randint(2, size=(1, 1, 401)), \
                            self.object_dist: np.random.randint(2, size=(1, 1, 1000))})
                self.writer.add_summary(summary_str, self.counter)

                print ("[Epoch {}] {}/{} | Time: {} | scene_loss: {} | obj_loss: {}".format(epoch, idx, batch_idxs, time.time() - start_time, l_scn, l_obj))

                if np.mod(self.counter, 1000) == 1000 - 1:
                    self.save(self.config['checkpoint_dir'], self.counter)

                self.counter += 1


    #########################
    #          Loss         #
    #########################
    # Adapt the answer here: http://stackoverflow.com/questions/41863814/kl-divergence-in-tensorflow
    def KL_divergence(self, dist_a, dist_b, name_scope='KL_Div'):
        return tf.reduce_mean(-tf.nn.softmax_cross_entropy_with_logits(logits=dist_a, labels=dist_b))


    #########################
    #       Save/Load       #
    #########################
    @property
    def get_model_dir(self):
        if self.config['model_dir'] is None:
            return "{}_{}".format(
                self.config['dataset_name'], self.config['batch_size'])
        else:
            return self.config['model_dir']
    

    def load(self, ckpt_dir='checkpoint'):
        return self.load_from_ckpt(ckpt_dir) if self.param_G is None \
        else self.load_from_npy()


    def save(self, checkpoint_dir, step):
        """ Checkpoint saver """
        model_name = "SoundNet.model"
        checkpoint_dir = os.path.join(checkpoint_dir, self.get_model_dir)

        if not os.path.exists(checkpoint_dir):
            os.makedirs(checkpoint_dir)

        self.saver.save(self.sess,
                        os.path.join(checkpoint_dir, model_name),
                        global_step=step)


    def load_from_ckpt(self, checkpoint_dir='checkpoint'):
        """ Checkpoint loader """
        print(" [*] Reading checkpoints...")

        checkpoint_dir = os.path.join(checkpoint_dir, self.get_model_dir)

        ckpt = tf.train.get_checkpoint_state(checkpoint_dir)
        if ckpt and ckpt.model_checkpoint_path:
            ckpt_name = os.path.basename(ckpt.model_checkpoint_path)
            self.saver.restore(self.sess, os.path.join(checkpoint_dir, ckpt_name))
            print(" [*] Success to read {}".format(ckpt_name))
            self.counter = int(ckpt_name.rsplit('-', 1)[-1])
            print(" [*] Start counter from {}".format(self.counter))
            return True
        else:
            print(" [*] Failed to find a checkpoint under {}".format(checkpoint_dir))
            return False


    def load_from_npy(self):
        if self.param_G is None: return False
        data_dict = self.param_G
        for key in data_dict:
            with tf.variable_scope(self.config['name_scope'] + '/'+ key, reuse=True):
                for subkey in data_dict[key]:
                    try:
                        var = tf.get_variable(subkey)
                        self.sess.run(var.assign(data_dict[key][subkey]))
                        print('Assign pretrain model {} to {}'.format(subkey, key))
                    except:
                        print('Ignore {}'.format(key))
        
        self.param_G.clear()
        return True


def main():

    args = parse_args()
    local_config['phase'] = args.phase

    # Setup visible device
    os.environ["CUDA_VISIBLE_DEVICES"] = args.cuda_device

    # Make path
    if not os.path.exists(args.outpath):
        os.mkdir(args.outpath)
    
    # Load pre-trained model
    param_G = np.load(local_config['param_g_dir'], encoding='latin1').item() \
            if args.phase in ['finetune', 'extract'] \
            else None

    # Init. Session
    sess_config = tf.ConfigProto()
    sess_config.allow_soft_placement=True
    sess_config.gpu_options.allow_growth = True
    
    with tf.Session(config=sess_config) as session:
        # Build model
        model = Model(session, config=local_config, param_G=param_G)
 
        if args.phase in ['train', 'finetune']:
            # Training phase
            model.train()
        elif args.phase == 'extract':
            # import when we need
            from extract_feat import extract_feat

            # Feature extractor
            #sound_sample = np.reshape(np.load('./data/demo.npy', encoding='latin1'), [local_config['batch_size'], -1, 1, 1])
            
            import librosa
            audio_path = './data/demo.mp3'
            sound_sample, _ = load_audio(audio_path)
            sound_sample = preprocess(sound_sample, config=local_config)

            output = extract_feat(model, sound_sample, args)


def parse_args():
    """ Parse input arguments """
    parser = argparse.ArgumentParser(description='SoundNet')
    
    parser.add_argument('-o', '--outpath', dest='outpath', help='output feature path. e.g., [output]', default='output')

    parser.add_argument('-p', '--phase', dest='phase', help='demo or extract feature. e.g., [train, finetune, extract]', default='finetune')

    parser.add_argument('-m', '--layer', dest='layer_min', help='start from which feature layer. e.g., [1]', type=int, default=1)

    parser.add_argument('-x', dest='layer_max', help='end at which feature layer. e.g., [24]', type=int, default=None)
    
    parser.add_argument('-c', '--cuda', dest='cuda_device', help='which cuda device to use. e.g., [0]', default='0')

    feature_parser = parser.add_mutually_exclusive_group(required=False)
    feature_parser.add_argument('-s', '--save', dest='is_save', help='Turn on save mode. [False(default), True]', action='store_true')
    parser.set_defaults(is_save=False)
    
    args = parser.parse_args()

    return args


if __name__ == '__main__':
    main()


================================================
FILE: model.py
================================================
# TensorFlow version of NIPS2016 soundnet

import sys
import numpy as np
import tensorflow as tf
from ops import batch_norm, conv2d, relu, maxpool

# Make xrange compatible in both Python 2, 3
try:
    xrange
except NameError:
    xrange = range

local_config = {  
            'batch_size': 1, 
            'eps': 1e-5,
            'name_scope': 'SoundNet',
            }

class Model():
    def __init__(self, session, config=local_config, param_G=None):
        # Print config
        for key in config: print("{}:{}".format(key, config[key]))

        self.sess           = session
        self.config         = config
        self.param_G        = param_G
        
        # Placeholder
        self.add_placeholders()
        
        # Generator
        self.add_generator(name_scope=self.config['name_scope'])


    def add_placeholders(self):
        self.sound_input_placeholder = tf.placeholder(tf.float32,
                shape=[self.config['batch_size'], None, 1, 1]) # batch x h x w x channel


    def add_generator(self, name_scope='SoundNet'):
        with tf.variable_scope(name_scope) as scope:
            self.layers = {}
            
            # Stream one: conv1 ~ conv7
            self.layers[1] = conv2d(self.sound_input_placeholder, 1, 16, k_h=64, d_h=2, p_h=32, name_scope='conv1')
            self.layers[2] = batch_norm(self.layers[1], 16, self.config['eps'], name_scope='conv1')
            self.layers[3] = relu(self.layers[2], name_scope='conv1')
            self.layers[4] = maxpool(self.layers[3], k_h=8, d_h=8, name_scope='conv1')

            self.layers[5] = conv2d(self.layers[4], 16, 32, k_h=32, d_h=2, p_h=16, name_scope='conv2')
            self.layers[6] = batch_norm(self.layers[5], 32, self.config['eps'], name_scope='conv2')
            self.layers[7] = relu(self.layers[6], name_scope='conv2')
            self.layers[8] = maxpool(self.layers[7], k_h=8, d_h=8, name_scope='conv2')

            self.layers[9] = conv2d(self.layers[8], 32, 64, k_h=16, d_h=2, p_h=8, name_scope='conv3')
            self.layers[10] = batch_norm(self.layers[9], 64, self.config['eps'], name_scope='conv3')
            self.layers[11] = relu(self.layers[10], name_scope='conv3')

            self.layers[12] = conv2d(self.layers[11], 64, 128, k_h=8, d_h=2, p_h=4, name_scope='conv4')
            self.layers[13] = batch_norm(self.layers[12], 128, self.config['eps'], name_scope='conv4')
            self.layers[14] = relu(self.layers[13], name_scope='conv4')

            self.layers[15] = conv2d(self.layers[14], 128, 256, k_h=4, d_h=2, p_h=2, name_scope='conv5')
            self.layers[16] = batch_norm(self.layers[15], 256, self.config['eps'], name_scope='conv5')
            self.layers[17] = relu(self.layers[16], name_scope='conv5')
            self.layers[18] = maxpool(self.layers[17], k_h=4, d_h=4, name_scope='conv5')

            self.layers[19] = conv2d(self.layers[18], 256, 512, k_h=4, d_h=2, p_h=2, name_scope='conv6')
            self.layers[20] = batch_norm(self.layers[19], 512, self.config['eps'], name_scope='conv6')
            self.layers[21] = relu(self.layers[20], name_scope='conv6')

            self.layers[22] = conv2d(self.layers[21], 512, 1024, k_h=4, d_h=2, p_h=2, name_scope='conv7')
            self.layers[23] = batch_norm(self.layers[22], 1024, self.config['eps'], name_scope='conv7')
            self.layers[24] = relu(self.layers[23], name_scope='conv7')

            # Split one: conv8, conv8_2
            self.layers[25] = conv2d(self.layers[24], 1024, 1000, k_h=8, d_h=2, name_scope='conv8')
            self.layers[26] = conv2d(self.layers[24], 1024, 401, k_h=8, d_h=2, name_scope='conv8_2')


    def load(self):
        if self.param_G is None: return False
        data_dict = self.param_G
        for key in data_dict:
            with tf.variable_scope(self.config['name_scope'] + '/' + key, reuse=True):
                for subkey in data_dict[key]:
                    try:
                        var = tf.get_variable(subkey)
                        self.sess.run(var.assign(data_dict[key][subkey]))
                        print('Assign pretrain model {} to {}'.format(subkey, key))
                    except:
                        print('Ignore {}'.format(key))
        self.param_G.clear()
        return True


if __name__ == '__main__':
    
    layer_min = int(sys.argv[1])
    layer_max = int(sys.argv[2]) if len(sys.argv) > 2 else layer_min + 1
    
    # Load pre-trained model
    G_name = './models/sound8.npy'
    param_G = np.load(G_name, encoding='latin1').item()
    dump_path = './output/'

    with tf.Session() as session:
        # Build model
        model = Model(session, config=local_config, param_G=param_G)
        init = tf.global_variables_initializer()
        session.run(init)
        
        model.load()
        
        # Demo
        sound_input = np.reshape(np.load('data/demo.npy', encoding='latin1'), [local_config['batch_size'], -1, 1, 1])
        feed_dict = {model.sound_input_placeholder: sound_input}
        
        # Forward
        for idx in xrange(layer_min, layer_max):
            feature = session.run(model.layers[idx], feed_dict=feed_dict)
            np.save(dump_path + 'tf_fea{}.npy'.format(str(idx).zfill(2)), np.squeeze(feature))
            print("Save layer {} with shape {} as {}tf_fea{}.npy".format(idx, np.squeeze(feature).shape, dump_path, str(idx).zfill(2)))


================================================
FILE: ops.py
================================================
# TensorFlow version of NIPS2016 soundnet
import tensorflow as tf

def conv2d(prev_layer, in_ch, out_ch, k_h=1, k_w=1, d_h=1, d_w=1, p_h=0, p_w=0, pad='VALID', name_scope='conv'):
    with tf.variable_scope(name_scope) as scope:
        # h x w x input_channel x output_channel
        w_conv = tf.get_variable('weights', [k_h, k_w, in_ch, out_ch], 
                initializer=tf.truncated_normal_initializer(0.0, stddev=0.01))
        b_conv = tf.get_variable('biases', [out_ch], 
                initializer=tf.constant_initializer(0.0))
        
        padded_input = tf.pad(prev_layer, [[0, 0], [p_h, p_h], [p_w, p_w], [0, 0]], "CONSTANT") if pad == 'VALID' \
                else prev_layer

        output = tf.nn.conv2d(padded_input, w_conv, 
                [1, d_h, d_w, 1], padding=pad, name='z') + b_conv
    
        return output


def batch_norm(prev_layer, out_ch, eps, name_scope='conv'):
    with tf.variable_scope(name_scope) as scope:
        #mu_conv, var_conv = tf.nn.moments(prev_layer, [0, 1, 2], keep_dims=False)
        mu_conv = tf.get_variable('mean', [out_ch], 
            initializer=tf.constant_initializer(0))
        var_conv = tf.get_variable('var', [out_ch], 
            initializer=tf.constant_initializer(1))
        gamma_conv = tf.get_variable('gamma', [out_ch], 
            initializer=tf.constant_initializer(1))
        beta_conv = tf.get_variable('beta', [out_ch], 
            initializer=tf.constant_initializer(0))
        output = tf.nn.batch_normalization(prev_layer, mu_conv, 
            var_conv, beta_conv, gamma_conv, eps, name='batch_norm')
        
        return output


def relu(prev_layer, name_scope='conv'):
    with tf.variable_scope(name_scope) as scope:
        return tf.nn.relu(prev_layer, name='a')


def maxpool(prev_layer, k_h=1, k_w=1, d_h=1, d_w=1, name_scope='conv'):
    with tf.variable_scope(name_scope) as scope:
        return tf.nn.max_pool(prev_layer, 
                [1, k_h, k_w, 1], [1, d_h, d_w, 1], padding='VALID', name='maxpool')


================================================
FILE: util.py
================================================
import numpy as np
import librosa
import pdb

local_config = {
            'batch_size': 64, 
            'load_size': 22050*20,
            'phase': 'extract'
            }


def load_from_list(name_list, config=local_config):
    assert len(name_list) == config['batch_size'], \
            "The length of name_list({})[{}] is not the same as batch_size[{}]".format(
                    name_list[0], len(name_list), config['batch_size'])
    audios = np.zeros([config['batch_size'], config['load_size'], 1, 1])
    for idx, audio_path in enumerate(name_list):
        sound_sample, _ = load_audio(audio_path)
        audios[idx] = preprocess(sound_sample, config)
        
    return audios


def load_from_txt(txt_name, config=local_config):
    with open(txt_name, 'r') as handle:
        txt_list = handle.read().splitlines()

    audios = []
    for idx, audio_path in enumerate(txt_list):
        sound_sample, _ = load_audio(audio_path)
        audios.append(preprocess(sound_sample, config))
        
    return audios


# NOTE: Load an audio as the same format in soundnet
# 1. Keep original sample rate (which conflicts their own paper)
# 2. Use first channel in multiple channels
# 3. Keep range in [-256, 256]

def load_audio(audio_path, sr=None):
    # By default, librosa will resample the signal to 22050Hz(sr=None). And range in (-1., 1.)
    sound_sample, sr = librosa.load(audio_path, sr=sr, mono=False)

    return sound_sample, sr


def preprocess(raw_audio, config=local_config):
    # Select first channel (mono)
    if len(raw_audio.shape) > 1:
        raw_audio = raw_audio[0]

    # Make range [-256, 256]
    raw_audio *= 256.0

    # Make minimum length available
    length = config['load_size']
    if length > raw_audio.shape[0]:
        raw_audio = np.tile(raw_audio, length/raw_audio.shape[0] + 1)

    # Make equal training length
    if config['phase'] != 'extract':
        raw_audio = raw_audio[:length]

    # Check conditions
    assert len(raw_audio.shape) == 1, "It seems this audio contains two channels, we only need the first channel"
    assert np.max(raw_audio) <= 256, "It seems this audio contains signal that exceeds 256"
    assert np.min(raw_audio) >= -256, "It seems this audio contains signal that exceeds -256"

    # Shape to 1 x DIM x 1 x 1
    raw_audio = np.reshape(raw_audio, [1, -1, 1, 1])

    return raw_audio.copy()