Repository: eborboihuc/SoundNet-tensorflow
Branch: master
Commit: b603cd4584a9
Files: 12
Total size: 36.7 KB
Directory structure:
gitextract_78ltv8rb/
├── .gitignore
├── LICENSE
├── README.md
├── cmp.py
├── demo.txt
├── extract_feat.py
├── h5convert.py
├── load_t7.py
├── main.py
├── model.py
├── ops.py
└── util.py
================================================
FILE CONTENTS
================================================
================================================
FILE: .gitignore
================================================
# Data
data/*
*.zip
output
models/*
# checkpoint
*logs
*checkpoint
# trash
.dropbox
# Created by https://www.gitignore.io/api/python,vim
### Python ###
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class
# C extensions
*.so
# Distribution / packaging
.Python
env/
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
*.egg-info/
.installed.cfg
*.egg
# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec
# Installer logs
pip-log.txt
pip-delete-this-directory.txt
# Unit test / coverage reports
htmlcov/
.tox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*,cover
.hypothesis/
# Translations
*.mo
*.pot
# Django stuff:
*.log
# Sphinx documentation
docs/_build/
# PyBuilder
target/
### Vim ###
[._]*.s[a-w][a-z]
[._]s[a-w][a-z]
*.un~
Session.vim
.netrwhist
*~
================================================
FILE: LICENSE
================================================
MIT License
Copyright (c) 2018 Hou-Ning Hu
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
================================================
FILE: README.md
================================================
# SoundNet-tensorflow
TensorFlow implementation of "SoundNet" that learns rich natural sound representations.
Code for paper "[SoundNet: Learning Sound Representations from Unlabeled Video](https://arxiv.org/abs/1610.09001)" by Yusuf Aytar, Carl Vondrick, Antonio Torralba. NIPS 2016

# Prerequisites
- Linux
- NVIDIA GPU + CUDA 8.0 + CuDNNv5.1
- Python 2.7 with numpy or Python 3.5
- [Tensorflow](https://www.tensorflow.org/) 1.0.0 (up to 1.3.0)
- librosa
# Getting Started
- Clone this repo:
```bash
git clone git@github.com:eborboihuc/SoundNet-tensorflow.git
cd SoundNet-tensorflow
```
- Pretrained Model
I provide pre-trained models that are ported from [soundnet](http://data.csail.mit.edu/soundnet/soundnet_models_public.zip). You can download the 8 layer model [here](https://drive.google.com/uc?export=download&id=0B9wE6h4m--wjR015M1RLZW45OEU). Please place it as `./models/sound8.npy` in your folder.
- Data
Prepare you input mp3 files and place them under `./data/`
Generate a input file txt and place it under `./`
```txt
./data/0001.mp3
./data/0002.mp3
./data/0003.mp3
...
```
Follow the steps in [extract features](#feature-extraction)
- NOTE
If you found out that [some audio with offset value `start` in FFMPEG will cause a tremendous difference between `torch audio` and `librosa`](#FAQs), please **convert it** with following command.
```
sox {input.mp3} {output.mp3} trim 0
```
After this, the result might be much better.
# Demo
For demo, you can follow the following steps
i) Download a converted npy file [demo.npy](https://drive.google.com/uc?export=download&id=0B9wE6h4m--wjcEtqQ3VIM1pvZ3c) and place it under `./data/`
ii) To extract multiple features from a pretrained model with torch `lua audio` loaded sound track:
The sound track is equivalent with torch version.
```bash
python extract_feat.py -m {start layer number} -x {end layer numbe} -s
```
Then you can compare the outputs with torch ones.
# Feature Extraction
## Minimum example
i) Download input file [demo.mp3](https://drive.google.com/uc?export=download&id=0B9wE6h4m--wjTjVEWVI3dnBsTG8) and place it under `./data/`
ii) Prepare a file list in `txt` format (`demo.txt`) that includes the input mp3 file(s) and place it under `./`
```txt
./data/demo.mp3
```
iii) Then extract features from raw wave in `demo.txt`:
Please put the demo mp3 under ./data/[demo.mp3](https://drive.google.com/uc?export=download&id=0B9wE6h4m--wjTjVEWVI3dnBsTG8)
```bash
python extract_feat.py -m {start layer number} -x {end layer numbe} -s -p extract -t demo.txt
```
## More options
To extract multiple features from a pretrained model with downloaded mp3 dataset:
```bash
python extract_feat.py -t {dataset_txt_name} -m {start layer number} -x {end layer numbe} -s -p extract
```
e.g. extract layer 4 to layer 17 and save as `./sound_out/tf_fea%02d.npy`:
```bash
python extract_feat.py -o sound_out -m 4 -x 17 -s -p extract
```
More details are in:
```bash
python extract_feat.py -h
```
# Finetuning
To train from an existing model:
```bash
python main.py
```
# Training
To train from scratch:
```bash
python main.py -p train
```
To extract features:
```bash
python main.py -p extract -m {start layer number} -x {end layer numbe} -s
```
More details are in:
```bash
python main.py -h
```
# TODOs
- [x] Change audio loader to soundnet format
- [x] Make it compatible to Python 3 format
- [ ] Batch Norm behaviour different from Torch
- [ ] Fix conv8 padding issue in training phase
- [ ] Change all `config` into `tf.app.flags`
- [ ] Change dummy distribution of scene and object to useful placeholder
- [ ] Add sound and feature loader from [Data](https://projects.csail.mit.edu/soundnet/) section
# Known issues
- Loaded audio length is not consist in `torch7 audio` and `librosa`. Here is the [issue](https://github.com/soumith/lua---audio/issues/17#issuecomment-288648237)
- Training with a short length audio will make conv8 complain about [output size would be negative](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/framework/common_shape_fns.cc#L45)
# FAQs
- Why my loaded sound wave is different from `torch7 audio` to `librosa`: Here is my [WiKi](https://github.com/eborboihuc/SoundNet-tensorflow/wiki/info.md)
# Acknowledgments
Code ported from [soundnet](https://github.com/cvondrick/soundnet). And Torch7-Tensorflow loader are from [tf_videogan](https://github.com/Yuliang-Zou/tf_videogan). Thanks for their excellent work!
## Author
Hou-Ning Hu / [@eborboihuc](https://eborboihuc.github.io/)
================================================
FILE: cmp.py
================================================
import numpy as np
import sys
name = sys.argv[1]
dec = int(sys.argv[2]) if len(sys.argv) >= 3 else 4
th = np.load('output/demo_th.npy', encoding='latin1').item()['layer{}'.format(name)].T
tf = np.load('output/tf_fea{}.npy'.format(str(name).zfill(2)), encoding='latin1')
if name == '25':
tf = np.concatenate([tf, np.load('output/tf_fea26.npy', encoding='latin1')], 1)
print('Layer {}: tf.shape={}, th.shape={}'.format(name, tf.shape, th.shape))
print('TF:')
print(tf)
print('Torch:')
print(th)
size = tf.shape[0] if tf.shape[0] < th.shape[0] else th.shape[0]
print('Round to {} decimals'.format(dec))
tf = np.round(tf, decimals=dec)
th = np.round(th, decimals=dec)
print('Total Diff: {} Max Diff: {} Min Diff: {}'.format(
np.sum(abs(tf[:size] - th[:size])), \
np.max(tf[:size] - th[:size]), \
np.min(tf[:size] - th[:size])))
================================================
FILE: demo.txt
================================================
data/demo.mp3
================================================
FILE: extract_feat.py
================================================
# TensorFlow version of NIPS2016 soundnet
from util import load_from_txt
from model import Model
import tensorflow as tf
import numpy as np
import argparse
import sys
import os
# Make xrange compatible in both Python 2, 3
try:
xrange
except NameError:
xrange = range
local_config = {
'batch_size': 1,
'eps': 1e-5,
'sample_rate': 22050,
'load_size': 22050*20,
'name_scope': 'SoundNet',
'phase': 'extract',
}
def parse_args():
""" Parse input arguments """
parser = argparse.ArgumentParser(description='Extract Feature')
parser.add_argument('-t', '--txt', dest='audio_txt', help='target audio txt path. e.g., [demo.txt]', default='demo.txt')
parser.add_argument('-o', '--outpath', dest='outpath', help='output feature path. e.g., [output]', default='output')
parser.add_argument('-p', '--phase', dest='phase', help='demo or extract feature. e.g., [demo, extract]', default='demo')
parser.add_argument('-m', '--layer', dest='layer_min', help='start from which feature layer. e.g., [1]', type=int, default=1)
parser.add_argument('-x', dest='layer_max', help='end at which feature layer. e.g., [24]', type=int, default=None)
parser.add_argument('-c', '--cuda', dest='cuda_device', help='which cuda device to use. e.g., [0]', default='0')
feature_parser = parser.add_mutually_exclusive_group(required=False)
feature_parser.add_argument('-s', '--save', dest='is_save', help='Turn on save mode. [False(default), True]', action='store_true')
parser.set_defaults(is_save=False)
args = parser.parse_args()
return args
def extract_feat(model, sound_input, config):
layer_min = config.layer_min
layer_max = config.layer_max if config.layer_max is not None else layer_min + 1
# Extract feature
features = {}
feed_dict = {model.sound_input_placeholder: sound_input}
for idx in xrange(layer_min, layer_max):
feature = model.sess.run(model.layers[idx], feed_dict=feed_dict)
features[idx] = feature
if config.is_save:
np.save(os.path.join(config.outpath, 'tf_fea{}.npy'.format( \
str(idx).zfill(2))), np.squeeze(feature))
print("Save layer {} with shape {} as {}/tf_fea{}.npy".format( \
idx, np.squeeze(feature).shape, config.outpath, str(idx).zfill(2)))
return features
if __name__ == '__main__':
args = parse_args()
# Setup visible device
os.environ["CUDA_VISIBLE_DEVICES"] = args.cuda_device
# Load pre-trained model
G_name = './models/sound8.npy'
param_G = np.load(G_name, encoding = 'latin1').item()
if args.phase == 'demo':
# Demo
sound_samples = [np.reshape(np.load('data/demo.npy', encoding='latin1'), [1, -1, 1, 1])]
else:
# Extract Feature
sound_samples = load_from_txt(args.audio_txt, config=local_config)
# Make path
if not os.path.exists(args.outpath):
os.mkdir(args.outpath)
# Init. Session
sess_config = tf.ConfigProto()
sess_config.allow_soft_placement=True
sess_config.gpu_options.allow_growth = True
with tf.Session(config=sess_config) as session:
# Build model
model = Model(session, config=local_config, param_G=param_G)
init = tf.global_variables_initializer()
session.run(init)
model.load()
for sound_sample in sound_samples:
output = extract_feat(model, sound_sample, args)
================================================
FILE: h5convert.py
================================================
import numpy as np
import h5py
import sys
th = h5py.File(sys.argv[1], 'r')
print th.keys()
if len(th.keys()) <= 1:
key = th.keys()[0]
npy = np.array(th[key])
else:
npy = {}
for key in th.keys():
npy[key] = np.array(th[key])
np.save(sys.argv[2], npy)
================================================
FILE: load_t7.py
================================================
# Load t7 files
# Required package: torchfile.
# $ pip install torchfile
import torchfile
import numpy as np
import pdb
# Make xrange compatible in both Python 2, 3
try:
xrange
except NameError:
xrange = range
keys = ['conv1', 'conv2', 'conv3', 'conv4', 'conv5', 'conv6',
'conv7', 'conv8', 'conv8_2']
def load(o, param_list):
""" Get torch7 weights into numpy array """
try:
num = len(o['modules'])
except:
num = 0
for i in xrange(num):
# 2D conv
if o['modules'][i]._typename == 'nn.SpatialConvolution' or \
o['modules'][i]._typename == 'cudnn.SpatialConvolution':
temp = {'weights': o['modules'][i]['weight'].transpose((2,3,1,0)),
'biases': o['modules'][i]['bias']}
param_list.append(temp)
# 2D deconv
elif o['modules'][i]._typename == 'nn.SpatialFullConvolution':
temp = {'weights': o['modules'][i]['weight'].transpose((2,3,1,0)),
'biases': o['modules'][i]['bias']}
param_list.append(temp)
# 3D conv
elif o['modules'][i]._typename == 'nn.VolumetricFullConvolution':
temp = {'weights': o['modules'][i]['weight'].transpose((2,3,4,1,0)),
'biases': o['modules'][i]['bias']}
param_list.append(temp)
# batch norm
elif o['modules'][i]._typename == 'nn.SpatialBatchNormalization' or \
o['modules'][i]._typename == 'nn.VolumetricBatchNormalization':
param_list[-1]['gamma'] = o['modules'][i]['weight']
param_list[-1]['beta'] = o['modules'][i]['bias']
param_list[-1]['mean'] = o['modules'][i]['running_mean']
param_list[-1]['var'] = o['modules'][i]['running_var']
load(o['modules'][i], param_list)
def show(o):
""" Show nn information """
nn = {}
nn_keys = {}
nn_info = {}
num = len(o['modules']) if o['modules'] else 0
mylist = get_mylist()
for i in xrange(num):
# Get _obj and keys from torchfile
nn[i] = o['modules'][i]._obj
nn_keys[i] = o['modules'][i]._obj.keys()
# Get information from _obj
# {layer i: {mylist keys: value}}
nn_info[i] = {key: nn[i][key] for key in sorted(nn_keys[i]) if key in mylist}
nn_info[i]['name'] = o['modules'][i]._typename
print(i, nn_info[i]['name'])
for item in sorted(nn_info[i].keys()):
print(" {}:{}".format(item, nn_info[i][item] if 'running' not in item \
else nn_info[i][item].shape))
def get_mylist():
""" Return manually selected information lists """
return ['_type', 'nInputPlane', 'nOutputPlane', \
'input_offset', 'groups', 'dH', 'dW', \
'padH', 'padW', 'kH', 'kW', 'iSize', \
'running_mean', 'running_var']
if __name__ == '__main__':
# File loader
t7_file = './models/soundnet8_final.t7'
o = torchfile.load(t7_file)
# To show nn parameter
show(o)
# To store as npy file
param_list = []
load(o, param_list)
save_list = {}
for i, k in enumerate(keys):
save_list[k] = param_list[i]
np.save('sound8', save_list)
================================================
FILE: main.py
================================================
# TensorFlow version of NIPS2016 soundnet
# Required package: librosa: A python package for music and audio analysis.
# $ pip install librosa
from ops import batch_norm, conv2d, relu, maxpool
from util import preprocess, load_from_list, load_audio
from model import Model
from glob import glob
import tensorflow as tf
import numpy as np
import argparse
import time
import sys
import os
# Make xrange compatible in both Python 2, 3
try:
xrange
except NameError:
xrange = range
local_config = {
'batch_size': 1,
'train_size': np.inf,
'epoch': 200,
'eps': 1e-5,
'learning_rate': 1e-3,
'beta1': 0.9,
'load_size': 22050*4,
'sample_rate': 22050,
'name_scope': 'SoundNet',
'phase': 'train',
'dataset_name': 'ESC50',
'subname': 'mp3',
'checkpoint_dir': 'checkpoint',
'dump_dir': 'output',
'model_dir': None,
'param_g_dir': './models/sound8.npy',
}
class Model():
def __init__(self, session, config=local_config, param_G=None):
self.sess = session
self.config = config
self.param_G = param_G
self.g_step = tf.Variable(0, trainable=False)
self.counter = 0
self.model()
def model(self):
# Placeholder
self.sound_input_placeholder = tf.placeholder(tf.float32,
shape=[self.config['batch_size'], None, 1, 1]) # batch x h x w x channel
self.object_dist = tf.placeholder(tf.float32,
shape=[self.config['batch_size'], None, 1000]) # batch x h x w x channel
self.scene_dist = tf.placeholder(tf.float32,
shape=[self.config['batch_size'], None, 401]) # batch x h x w x channel
# Generator
self.add_generator(name_scope=self.config['name_scope'])
# KL Divergence
self.object_loss = self.KL_divergence(self.layers[25], self.object_dist, name_scope='KL_Div_object')
self.scene_loss = self.KL_divergence(self.layers[26], self.scene_dist, name_scope='KL_Div_scene')
self.loss = self.object_loss + self.scene_loss
# Summary
self.loss_sum = tf.summary.scalar("g_loss", self.loss)
self.g_sum = tf.summary.merge([self.loss_sum])
self.writer = tf.summary.FileWriter("./logs", self.sess.graph)
# variable collection
self.g_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES,
scope=self.config['name_scope'])
self.saver = tf.train.Saver(keep_checkpoint_every_n_hours=12,
max_to_keep=5,
restore_sequentially=True)
# Optimizer and summary
self.g_optim = tf.train.AdamOptimizer(self.config['learning_rate'], beta1=self.config['beta1']) \
.minimize(self.loss, var_list=(self.g_vars), global_step=self.g_step)
# Initialize
init_op = tf.global_variables_initializer()
self.sess.run(init_op)
# Load checkpoint
if self.load(self.config['checkpoint_dir']):
print(" [*] Load SUCCESS")
else:
print(" [!] Load failed...")
def add_generator(self, name_scope='SoundNet'):
with tf.variable_scope(name_scope) as scope:
self.layers = {}
# Stream one: conv1 ~ conv7
self.layers[1] = conv2d(self.sound_input_placeholder, 1, 16, k_h=64, d_h=2, p_h=32, name_scope='conv1')
self.layers[2] = batch_norm(self.layers[1], 16, self.config['eps'], name_scope='conv1')
self.layers[3] = relu(self.layers[2], name_scope='conv1')
self.layers[4] = maxpool(self.layers[3], k_h=8, d_h=8, name_scope='conv1')
self.layers[5] = conv2d(self.layers[4], 16, 32, k_h=32, d_h=2, p_h=16, name_scope='conv2')
self.layers[6] = batch_norm(self.layers[5], 32, self.config['eps'], name_scope='conv2')
self.layers[7] = relu(self.layers[6], name_scope='conv2')
self.layers[8] = maxpool(self.layers[7], k_h=8, d_h=8, name_scope='conv2')
self.layers[9] = conv2d(self.layers[8], 32, 64, k_h=16, d_h=2, p_h=8, name_scope='conv3')
self.layers[10] = batch_norm(self.layers[9], 64, self.config['eps'], name_scope='conv3')
self.layers[11] = relu(self.layers[10], name_scope='conv3')
self.layers[12] = conv2d(self.layers[11], 64, 128, k_h=8, d_h=2, p_h=4, name_scope='conv4')
self.layers[13] = batch_norm(self.layers[12], 128, self.config['eps'], name_scope='conv4')
self.layers[14] = relu(self.layers[13], name_scope='conv4')
self.layers[15] = conv2d(self.layers[14], 128, 256, k_h=4, d_h=2, p_h=2, name_scope='conv5')
self.layers[16] = batch_norm(self.layers[15], 256, self.config['eps'], name_scope='conv5')
self.layers[17] = relu(self.layers[16], name_scope='conv5')
self.layers[18] = maxpool(self.layers[17], k_h=4, d_h=4, name_scope='conv5')
self.layers[19] = conv2d(self.layers[18], 256, 512, k_h=4, d_h=2, p_h=2, name_scope='conv6')
self.layers[20] = batch_norm(self.layers[19], 512, self.config['eps'], name_scope='conv6')
self.layers[21] = relu(self.layers[20], name_scope='conv6')
self.layers[22] = conv2d(self.layers[21], 512, 1024, k_h=4, d_h=2, p_h=2, name_scope='conv7')
self.layers[23] = batch_norm(self.layers[22], 1024, self.config['eps'], name_scope='conv7')
self.layers[24] = relu(self.layers[23], name_scope='conv7')
# Split one: conv8, conv8_2
# NOTE: here we use a padding of 2 to skip an unknown error
# https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/framework/common_shape_fns.cc#L45
self.layers[25] = conv2d(self.layers[24], 1024, 1000, k_h=8, d_h=2, p_h=2, name_scope='conv8')
self.layers[26] = conv2d(self.layers[24], 1024, 401, k_h=8, d_h=2, p_h=2, name_scope='conv8_2')
def train(self):
"""Train SoundNet"""
start_time = time.time()
# Data info
data = glob('./data/*.{}'.format(self.config['subname']))
batch_idxs = min(len(data), self.config['train_size']) // self.config['batch_size']
for epoch in xrange(self.counter//batch_idxs, self.config['epoch']):
for idx in xrange(self.counter%batch_idxs, batch_idxs):
# By default, librosa will resample the signal to 22050Hz. And range in (-1., 1.)
sound_sample = load_from_list(data[idx*self.config['batch_size']:(idx+1)*self.config['batch_size']], self.config)
# Update G network
# NOTE: Here we still use dummy random distribution for scene and objects
_, summary_str, l_scn, l_obj = self.sess.run([self.g_optim, self.g_sum, self.scene_loss, self.object_loss],
feed_dict={self.sound_input_placeholder: sound_sample, \
self.scene_dist: np.random.randint(2, size=(1, 1, 401)), \
self.object_dist: np.random.randint(2, size=(1, 1, 1000))})
self.writer.add_summary(summary_str, self.counter)
print ("[Epoch {}] {}/{} | Time: {} | scene_loss: {} | obj_loss: {}".format(epoch, idx, batch_idxs, time.time() - start_time, l_scn, l_obj))
if np.mod(self.counter, 1000) == 1000 - 1:
self.save(self.config['checkpoint_dir'], self.counter)
self.counter += 1
#########################
# Loss #
#########################
# Adapt the answer here: http://stackoverflow.com/questions/41863814/kl-divergence-in-tensorflow
def KL_divergence(self, dist_a, dist_b, name_scope='KL_Div'):
return tf.reduce_mean(-tf.nn.softmax_cross_entropy_with_logits(logits=dist_a, labels=dist_b))
#########################
# Save/Load #
#########################
@property
def get_model_dir(self):
if self.config['model_dir'] is None:
return "{}_{}".format(
self.config['dataset_name'], self.config['batch_size'])
else:
return self.config['model_dir']
def load(self, ckpt_dir='checkpoint'):
return self.load_from_ckpt(ckpt_dir) if self.param_G is None \
else self.load_from_npy()
def save(self, checkpoint_dir, step):
""" Checkpoint saver """
model_name = "SoundNet.model"
checkpoint_dir = os.path.join(checkpoint_dir, self.get_model_dir)
if not os.path.exists(checkpoint_dir):
os.makedirs(checkpoint_dir)
self.saver.save(self.sess,
os.path.join(checkpoint_dir, model_name),
global_step=step)
def load_from_ckpt(self, checkpoint_dir='checkpoint'):
""" Checkpoint loader """
print(" [*] Reading checkpoints...")
checkpoint_dir = os.path.join(checkpoint_dir, self.get_model_dir)
ckpt = tf.train.get_checkpoint_state(checkpoint_dir)
if ckpt and ckpt.model_checkpoint_path:
ckpt_name = os.path.basename(ckpt.model_checkpoint_path)
self.saver.restore(self.sess, os.path.join(checkpoint_dir, ckpt_name))
print(" [*] Success to read {}".format(ckpt_name))
self.counter = int(ckpt_name.rsplit('-', 1)[-1])
print(" [*] Start counter from {}".format(self.counter))
return True
else:
print(" [*] Failed to find a checkpoint under {}".format(checkpoint_dir))
return False
def load_from_npy(self):
if self.param_G is None: return False
data_dict = self.param_G
for key in data_dict:
with tf.variable_scope(self.config['name_scope'] + '/'+ key, reuse=True):
for subkey in data_dict[key]:
try:
var = tf.get_variable(subkey)
self.sess.run(var.assign(data_dict[key][subkey]))
print('Assign pretrain model {} to {}'.format(subkey, key))
except:
print('Ignore {}'.format(key))
self.param_G.clear()
return True
def main():
args = parse_args()
local_config['phase'] = args.phase
# Setup visible device
os.environ["CUDA_VISIBLE_DEVICES"] = args.cuda_device
# Make path
if not os.path.exists(args.outpath):
os.mkdir(args.outpath)
# Load pre-trained model
param_G = np.load(local_config['param_g_dir'], encoding='latin1').item() \
if args.phase in ['finetune', 'extract'] \
else None
# Init. Session
sess_config = tf.ConfigProto()
sess_config.allow_soft_placement=True
sess_config.gpu_options.allow_growth = True
with tf.Session(config=sess_config) as session:
# Build model
model = Model(session, config=local_config, param_G=param_G)
if args.phase in ['train', 'finetune']:
# Training phase
model.train()
elif args.phase == 'extract':
# import when we need
from extract_feat import extract_feat
# Feature extractor
#sound_sample = np.reshape(np.load('./data/demo.npy', encoding='latin1'), [local_config['batch_size'], -1, 1, 1])
import librosa
audio_path = './data/demo.mp3'
sound_sample, _ = load_audio(audio_path)
sound_sample = preprocess(sound_sample, config=local_config)
output = extract_feat(model, sound_sample, args)
def parse_args():
""" Parse input arguments """
parser = argparse.ArgumentParser(description='SoundNet')
parser.add_argument('-o', '--outpath', dest='outpath', help='output feature path. e.g., [output]', default='output')
parser.add_argument('-p', '--phase', dest='phase', help='demo or extract feature. e.g., [train, finetune, extract]', default='finetune')
parser.add_argument('-m', '--layer', dest='layer_min', help='start from which feature layer. e.g., [1]', type=int, default=1)
parser.add_argument('-x', dest='layer_max', help='end at which feature layer. e.g., [24]', type=int, default=None)
parser.add_argument('-c', '--cuda', dest='cuda_device', help='which cuda device to use. e.g., [0]', default='0')
feature_parser = parser.add_mutually_exclusive_group(required=False)
feature_parser.add_argument('-s', '--save', dest='is_save', help='Turn on save mode. [False(default), True]', action='store_true')
parser.set_defaults(is_save=False)
args = parser.parse_args()
return args
if __name__ == '__main__':
main()
================================================
FILE: model.py
================================================
# TensorFlow version of NIPS2016 soundnet
import sys
import numpy as np
import tensorflow as tf
from ops import batch_norm, conv2d, relu, maxpool
# Make xrange compatible in both Python 2, 3
try:
xrange
except NameError:
xrange = range
local_config = {
'batch_size': 1,
'eps': 1e-5,
'name_scope': 'SoundNet',
}
class Model():
def __init__(self, session, config=local_config, param_G=None):
# Print config
for key in config: print("{}:{}".format(key, config[key]))
self.sess = session
self.config = config
self.param_G = param_G
# Placeholder
self.add_placeholders()
# Generator
self.add_generator(name_scope=self.config['name_scope'])
def add_placeholders(self):
self.sound_input_placeholder = tf.placeholder(tf.float32,
shape=[self.config['batch_size'], None, 1, 1]) # batch x h x w x channel
def add_generator(self, name_scope='SoundNet'):
with tf.variable_scope(name_scope) as scope:
self.layers = {}
# Stream one: conv1 ~ conv7
self.layers[1] = conv2d(self.sound_input_placeholder, 1, 16, k_h=64, d_h=2, p_h=32, name_scope='conv1')
self.layers[2] = batch_norm(self.layers[1], 16, self.config['eps'], name_scope='conv1')
self.layers[3] = relu(self.layers[2], name_scope='conv1')
self.layers[4] = maxpool(self.layers[3], k_h=8, d_h=8, name_scope='conv1')
self.layers[5] = conv2d(self.layers[4], 16, 32, k_h=32, d_h=2, p_h=16, name_scope='conv2')
self.layers[6] = batch_norm(self.layers[5], 32, self.config['eps'], name_scope='conv2')
self.layers[7] = relu(self.layers[6], name_scope='conv2')
self.layers[8] = maxpool(self.layers[7], k_h=8, d_h=8, name_scope='conv2')
self.layers[9] = conv2d(self.layers[8], 32, 64, k_h=16, d_h=2, p_h=8, name_scope='conv3')
self.layers[10] = batch_norm(self.layers[9], 64, self.config['eps'], name_scope='conv3')
self.layers[11] = relu(self.layers[10], name_scope='conv3')
self.layers[12] = conv2d(self.layers[11], 64, 128, k_h=8, d_h=2, p_h=4, name_scope='conv4')
self.layers[13] = batch_norm(self.layers[12], 128, self.config['eps'], name_scope='conv4')
self.layers[14] = relu(self.layers[13], name_scope='conv4')
self.layers[15] = conv2d(self.layers[14], 128, 256, k_h=4, d_h=2, p_h=2, name_scope='conv5')
self.layers[16] = batch_norm(self.layers[15], 256, self.config['eps'], name_scope='conv5')
self.layers[17] = relu(self.layers[16], name_scope='conv5')
self.layers[18] = maxpool(self.layers[17], k_h=4, d_h=4, name_scope='conv5')
self.layers[19] = conv2d(self.layers[18], 256, 512, k_h=4, d_h=2, p_h=2, name_scope='conv6')
self.layers[20] = batch_norm(self.layers[19], 512, self.config['eps'], name_scope='conv6')
self.layers[21] = relu(self.layers[20], name_scope='conv6')
self.layers[22] = conv2d(self.layers[21], 512, 1024, k_h=4, d_h=2, p_h=2, name_scope='conv7')
self.layers[23] = batch_norm(self.layers[22], 1024, self.config['eps'], name_scope='conv7')
self.layers[24] = relu(self.layers[23], name_scope='conv7')
# Split one: conv8, conv8_2
self.layers[25] = conv2d(self.layers[24], 1024, 1000, k_h=8, d_h=2, name_scope='conv8')
self.layers[26] = conv2d(self.layers[24], 1024, 401, k_h=8, d_h=2, name_scope='conv8_2')
def load(self):
if self.param_G is None: return False
data_dict = self.param_G
for key in data_dict:
with tf.variable_scope(self.config['name_scope'] + '/' + key, reuse=True):
for subkey in data_dict[key]:
try:
var = tf.get_variable(subkey)
self.sess.run(var.assign(data_dict[key][subkey]))
print('Assign pretrain model {} to {}'.format(subkey, key))
except:
print('Ignore {}'.format(key))
self.param_G.clear()
return True
if __name__ == '__main__':
layer_min = int(sys.argv[1])
layer_max = int(sys.argv[2]) if len(sys.argv) > 2 else layer_min + 1
# Load pre-trained model
G_name = './models/sound8.npy'
param_G = np.load(G_name, encoding='latin1').item()
dump_path = './output/'
with tf.Session() as session:
# Build model
model = Model(session, config=local_config, param_G=param_G)
init = tf.global_variables_initializer()
session.run(init)
model.load()
# Demo
sound_input = np.reshape(np.load('data/demo.npy', encoding='latin1'), [local_config['batch_size'], -1, 1, 1])
feed_dict = {model.sound_input_placeholder: sound_input}
# Forward
for idx in xrange(layer_min, layer_max):
feature = session.run(model.layers[idx], feed_dict=feed_dict)
np.save(dump_path + 'tf_fea{}.npy'.format(str(idx).zfill(2)), np.squeeze(feature))
print("Save layer {} with shape {} as {}tf_fea{}.npy".format(idx, np.squeeze(feature).shape, dump_path, str(idx).zfill(2)))
================================================
FILE: ops.py
================================================
# TensorFlow version of NIPS2016 soundnet
import tensorflow as tf
def conv2d(prev_layer, in_ch, out_ch, k_h=1, k_w=1, d_h=1, d_w=1, p_h=0, p_w=0, pad='VALID', name_scope='conv'):
with tf.variable_scope(name_scope) as scope:
# h x w x input_channel x output_channel
w_conv = tf.get_variable('weights', [k_h, k_w, in_ch, out_ch],
initializer=tf.truncated_normal_initializer(0.0, stddev=0.01))
b_conv = tf.get_variable('biases', [out_ch],
initializer=tf.constant_initializer(0.0))
padded_input = tf.pad(prev_layer, [[0, 0], [p_h, p_h], [p_w, p_w], [0, 0]], "CONSTANT") if pad == 'VALID' \
else prev_layer
output = tf.nn.conv2d(padded_input, w_conv,
[1, d_h, d_w, 1], padding=pad, name='z') + b_conv
return output
def batch_norm(prev_layer, out_ch, eps, name_scope='conv'):
with tf.variable_scope(name_scope) as scope:
#mu_conv, var_conv = tf.nn.moments(prev_layer, [0, 1, 2], keep_dims=False)
mu_conv = tf.get_variable('mean', [out_ch],
initializer=tf.constant_initializer(0))
var_conv = tf.get_variable('var', [out_ch],
initializer=tf.constant_initializer(1))
gamma_conv = tf.get_variable('gamma', [out_ch],
initializer=tf.constant_initializer(1))
beta_conv = tf.get_variable('beta', [out_ch],
initializer=tf.constant_initializer(0))
output = tf.nn.batch_normalization(prev_layer, mu_conv,
var_conv, beta_conv, gamma_conv, eps, name='batch_norm')
return output
def relu(prev_layer, name_scope='conv'):
with tf.variable_scope(name_scope) as scope:
return tf.nn.relu(prev_layer, name='a')
def maxpool(prev_layer, k_h=1, k_w=1, d_h=1, d_w=1, name_scope='conv'):
with tf.variable_scope(name_scope) as scope:
return tf.nn.max_pool(prev_layer,
[1, k_h, k_w, 1], [1, d_h, d_w, 1], padding='VALID', name='maxpool')
================================================
FILE: util.py
================================================
import numpy as np
import librosa
import pdb
local_config = {
'batch_size': 64,
'load_size': 22050*20,
'phase': 'extract'
}
def load_from_list(name_list, config=local_config):
assert len(name_list) == config['batch_size'], \
"The length of name_list({})[{}] is not the same as batch_size[{}]".format(
name_list[0], len(name_list), config['batch_size'])
audios = np.zeros([config['batch_size'], config['load_size'], 1, 1])
for idx, audio_path in enumerate(name_list):
sound_sample, _ = load_audio(audio_path)
audios[idx] = preprocess(sound_sample, config)
return audios
def load_from_txt(txt_name, config=local_config):
with open(txt_name, 'r') as handle:
txt_list = handle.read().splitlines()
audios = []
for idx, audio_path in enumerate(txt_list):
sound_sample, _ = load_audio(audio_path)
audios.append(preprocess(sound_sample, config))
return audios
# NOTE: Load an audio as the same format in soundnet
# 1. Keep original sample rate (which conflicts their own paper)
# 2. Use first channel in multiple channels
# 3. Keep range in [-256, 256]
def load_audio(audio_path, sr=None):
# By default, librosa will resample the signal to 22050Hz(sr=None). And range in (-1., 1.)
sound_sample, sr = librosa.load(audio_path, sr=sr, mono=False)
return sound_sample, sr
def preprocess(raw_audio, config=local_config):
# Select first channel (mono)
if len(raw_audio.shape) > 1:
raw_audio = raw_audio[0]
# Make range [-256, 256]
raw_audio *= 256.0
# Make minimum length available
length = config['load_size']
if length > raw_audio.shape[0]:
raw_audio = np.tile(raw_audio, length/raw_audio.shape[0] + 1)
# Make equal training length
if config['phase'] != 'extract':
raw_audio = raw_audio[:length]
# Check conditions
assert len(raw_audio.shape) == 1, "It seems this audio contains two channels, we only need the first channel"
assert np.max(raw_audio) <= 256, "It seems this audio contains signal that exceeds 256"
assert np.min(raw_audio) >= -256, "It seems this audio contains signal that exceeds -256"
# Shape to 1 x DIM x 1 x 1
raw_audio = np.reshape(raw_audio, [1, -1, 1, 1])
return raw_audio.copy()
gitextract_78ltv8rb/ ├── .gitignore ├── LICENSE ├── README.md ├── cmp.py ├── demo.txt ├── extract_feat.py ├── h5convert.py ├── load_t7.py ├── main.py ├── model.py ├── ops.py └── util.py
SYMBOL INDEX (31 symbols across 6 files)
FILE: extract_feat.py
function parse_args (line 26) | def parse_args():
function extract_feat (line 51) | def extract_feat(model, sound_input, config):
FILE: load_t7.py
function load (line 18) | def load(o, param_list):
function show (line 53) | def show(o):
function get_mylist (line 76) | def get_mylist():
FILE: main.py
class Model (line 44) | class Model():
method __init__ (line 45) | def __init__(self, session, config=local_config, param_G=None):
method model (line 54) | def model(self):
method add_generator (line 99) | def add_generator(self, name_scope='SoundNet'):
method train (line 142) | def train(self):
method KL_divergence (line 177) | def KL_divergence(self, dist_a, dist_b, name_scope='KL_Div'):
method get_model_dir (line 185) | def get_model_dir(self):
method load (line 193) | def load(self, ckpt_dir='checkpoint'):
method save (line 198) | def save(self, checkpoint_dir, step):
method load_from_ckpt (line 211) | def load_from_ckpt(self, checkpoint_dir='checkpoint'):
method load_from_npy (line 230) | def load_from_npy(self):
function main (line 247) | def main():
function parse_args (line 291) | def parse_args():
FILE: model.py
class Model (line 20) | class Model():
method __init__ (line 21) | def __init__(self, session, config=local_config, param_G=None):
method add_placeholders (line 36) | def add_placeholders(self):
method add_generator (line 41) | def add_generator(self, name_scope='SoundNet'):
method load (line 82) | def load(self):
FILE: ops.py
function conv2d (line 4) | def conv2d(prev_layer, in_ch, out_ch, k_h=1, k_w=1, d_h=1, d_w=1, p_h=0,...
function batch_norm (line 21) | def batch_norm(prev_layer, out_ch, eps, name_scope='conv'):
function relu (line 38) | def relu(prev_layer, name_scope='conv'):
function maxpool (line 43) | def maxpool(prev_layer, k_h=1, k_w=1, d_h=1, d_w=1, name_scope='conv'):
FILE: util.py
function load_from_list (line 12) | def load_from_list(name_list, config=local_config):
function load_from_txt (line 24) | def load_from_txt(txt_name, config=local_config):
function load_audio (line 41) | def load_audio(audio_path, sr=None):
function preprocess (line 48) | def preprocess(raw_audio, config=local_config):
Condensed preview — 12 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (39K chars).
[
{
"path": ".gitignore",
"chars": 963,
"preview": "# Data\ndata/*\n*.zip\noutput\nmodels/*\n\n# checkpoint\n*logs\n*checkpoint\n\n# trash\n.dropbox\n\n# Created by https://www.gitignor"
},
{
"path": "LICENSE",
"chars": 1068,
"preview": "MIT License\n\nCopyright (c) 2018 Hou-Ning Hu\n\nPermission is hereby granted, free of charge, to any person obtaining a cop"
},
{
"path": "README.md",
"chars": 4741,
"preview": "# SoundNet-tensorflow\nTensorFlow implementation of \"SoundNet\" that learns rich natural sound representations.\n\nCode for "
},
{
"path": "cmp.py",
"chars": 846,
"preview": "import numpy as np\nimport sys\n\nname = sys.argv[1]\ndec = int(sys.argv[2]) if len(sys.argv) >= 3 else 4\n\nth = np.load('ou"
},
{
"path": "demo.txt",
"chars": 14,
"preview": "data/demo.mp3\n"
},
{
"path": "extract_feat.py",
"chars": 3576,
"preview": "# TensorFlow version of NIPS2016 soundnet\n\nfrom util import load_from_txt\nfrom model import Model\nimport tensorflow as t"
},
{
"path": "h5convert.py",
"chars": 281,
"preview": "import numpy as np\nimport h5py\nimport sys\n\n\nth = h5py.File(sys.argv[1], 'r')\nprint th.keys()\n\n\nif len(th.keys()) <= 1:\n "
},
{
"path": "load_t7.py",
"chars": 3276,
"preview": "# Load t7 files\n# Required package: torchfile. \n# $ pip install torchfile\n\nimport torchfile\nimport numpy as np\nimport pd"
},
{
"path": "main.py",
"chars": 13014,
"preview": "# TensorFlow version of NIPS2016 soundnet\n# Required package: librosa: A python package for music and audio analysis.\n# "
},
{
"path": "model.py",
"chars": 5408,
"preview": "# TensorFlow version of NIPS2016 soundnet\n\nimport sys\nimport numpy as np\nimport tensorflow as tf\nfrom ops import batch_n"
},
{
"path": "ops.py",
"chars": 2021,
"preview": "# TensorFlow version of NIPS2016 soundnet\nimport tensorflow as tf\n\ndef conv2d(prev_layer, in_ch, out_ch, k_h=1, k_w=1, d"
},
{
"path": "util.py",
"chars": 2382,
"preview": "import numpy as np\nimport librosa\nimport pdb\n\nlocal_config = {\n 'batch_size': 64, \n 'load_size': 2"
}
]
About this extraction
This page contains the full source code of the eborboihuc/SoundNet-tensorflow GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 12 files (36.7 KB), approximately 10.2k tokens, and a symbol index with 31 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.
Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.